Multimodal AI for Image Captioning: Using VGG16 and Attention-Enhanced LSTM Networks for Visual Description Generation

J. Nalini; Puliga Jyothi; Neharika Sathagopam; P. Jahnavi Umasri; P.  Indira Priya Darshini; P. Bhavya Rao

Authors

J. Nalini Author https://orcid.org/0009-0005-5881-7712
Puliga Jyothi Author https://orcid.org/0009-0001-2672-3691
Neharika Sathagopam Author https://orcid.org/0009-0009-3306-9702
P. Jahnavi Umasri Author https://orcid.org/0009-0008-1850-9100
P. Indira Priya Darshini Author https://orcid.org/0009-0004-9955-5985
P. Bhavya Rao Author https://orcid.org/0009-0006-8332-0481

Keywords:

Image Captioning, VGG16, Attention Mechanism, LSTM Networks, Multimodal Learning, BLEU Score, Natural Language Generation, Deep Learning

Abstract

Image captioning, a pivotal task in the intersection of computer vision and natural language processing, involves generating descriptive sentences for given visual inputs. Despite significant advancements, achieving semantically rich and contextually coherent captions remains challenging due to limitations in traditional convolutional-recurrent architectures. This study aims to enhance visual description generation by proposing a multimodal AI framework that integrates VGG16-based deep feature extraction with an attention-enhanced Long Short-Term Memory (LSTM) network. The methodology employs a pre-trained VGG16 model for extracting hierarchical image features, coupled with an attention mechanism that dynamically focuses on salient regions during sequential caption generation. The Flickr8k dataset, comprising 8,092 images with five captions each, was utilized for model training and evaluation. Images were resized to 224×224224\times224224×224 pixels, and captions were preprocessed with tokenization and sequence padding. The model was trained using the Adam optimizer, incorporating early stopping and adaptive learning rate strategies, and evaluated across BLEU-1, BLEU-2, BLEU-4, METEOR, and ROUGE-L metrics. Experimental results demonstrate that the proposed framework outperforms baseline VGG16-LSTM and CNN-GRU models, achieving a BLEU-4 score of 30.3%, a METEOR score of 31.2%, and a ROUGE-L score of 58.9%. Training time was reduced to three hours, with an inference speed of 0.09 seconds per image. The findings underscore the system’s potential for real-time applications such as assistive technologies and intelligent content indexing. In conclusion, by combining deep convolutional features with adaptive attention mechanisms, this study advances the state of multimodal learning and provides a scalable solution for generating accurate and contextually relevant image captions

Multimodal AI for Image Captioning: Using VGG16 and Attention-Enhanced LSTM Networks for Visual Description Generation

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Information

Journal Information

Latest publications

Keywords

Language

Browse

QUICK LINKS

FOR AUTHORS

FOR REVIEWERS

JOURNAL CONTENTS

DOWNLOADS