Multimodal AI for Image Captioning: Using VGG16 and Attention-Enhanced LSTM Networks for Visual Description Generation
Keywords:
Image Captioning, VGG16, Attention Mechanism, LSTM Networks, Multimodal Learning, BLEU Score, Natural Language Generation, Deep LearningAbstract
Image captioning, a pivotal task in the intersection of computer vision and natural language processing, involves generating descriptive sentences for given visual inputs. Despite significant advancements, achieving semantically rich and contextually coherent captions remains challenging due to limitations in traditional convolutional-recurrent architectures. This study aims to enhance visual description generation by proposing a multimodal AI framework that integrates VGG16-based deep feature extraction with an attention-enhanced Long Short-Term Memory (LSTM) network. The methodology employs a pre-trained VGG16 model for extracting hierarchical image features, coupled with an attention mechanism that dynamically focuses on salient regions during sequential caption generation. The Flickr8k dataset, comprising 8,092 images with five captions each, was utilized for model training and evaluation. Images were resized to 224×224224\times224224×224 pixels, and captions were preprocessed with tokenization and sequence padding. The model was trained using the Adam optimizer, incorporating early stopping and adaptive learning rate strategies, and evaluated across BLEU-1, BLEU-2, BLEU-4, METEOR, and ROUGE-L metrics. Experimental results demonstrate that the proposed framework outperforms baseline VGG16-LSTM and CNN-GRU models, achieving a BLEU-4 score of 30.3%, a METEOR score of 31.2%, and a ROUGE-L score of 58.9%. Training time was reduced to three hours, with an inference speed of 0.09 seconds per image. The findings underscore the system’s potential for real-time applications such as assistive technologies and intelligent content indexing. In conclusion, by combining deep convolutional features with adaptive attention mechanisms, this study advances the state of multimodal learning and provides a scalable solution for generating accurate and contextually relevant image captions
