Caption Generation Research Articles

The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively.

AbstractAutomatic Video captioning system is a method of describing the content in a video by analysing its visual aspects with regard to space and time and producing a meaningful caption that explains the video. A decade of research in this area has resulted in a steep growth in the quality and appropriateness of the generated caption compared with the expected result. The research has been driven from the very basic method to most advanced transformer method. Machine generated caption of a video must be adhering to many expected standards. For humans, this task may be a trivial one, however its not the same for a machine to analyse the content and generate a semantically coherent description for it. The caption which is generated in a natural language must also adhere to its lexical and syntactical structure. The video captioning process is a culmination of computer vision and natural language processing tasks. Commencing with template based conventional approach, it has surpassed statistical method, traditional deep learning approaches and is now in the trend of using transformers. This work made an extensive study of the literature and has proposed an improved transformer‐based architecture for video captioning process. The transformer architecture made use of an encoder and decoder model that has two and three sublayers respectively. Multi‐head self attention and cross attention are part of the model which bring about very beneficial results. The decoder is auto‐regressive and uses a masked layer to prevent the model from foreseeing future words in the caption. An enhanced encoder‐decoder Transformer model with CNN for feature extraction has been used in our work. This model captures the long‐range dependencies and temporal relationships more effectively. The model has been evaluated with benchmark datasets and compared with state‐of‐the‐art methods and found to be slightly better in the performance. The performance scores are slightly varying for BLEU, METEOR, ROUGE and CIDEr. Furthermore, we propose the idea of curriculum learning if incorporated can improve the results again.

Caption Generation Research Articles

Related Topics

Articles published on Caption Generation

Image caption generation using transformer learning methods: a case study on instagram image

An Efficient Image Captioning Method Based on Beam Search

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

Deep Learning Approaches on Image Captioning: A Review

Neural image caption generator based on crossbar array design of memristor module

Application of a Short Video Caption Generation Algorithm in International Chinese Education and Teaching

RelNet-MAM: Relation Network with Multilevel Attention Mechanism for Image Captioning

Contrastive topic-enhanced network for video captioning

Combining semi-supervised model and optimized LSTM for image caption generation based on pseudo labels

A latent topic‐aware network for dense video captioning

Parallel Dense Video Caption Generation with Multi-Modal Features

Image Caption Generation Supported Information and Recommender System for Tourists, Including Supplements for Individuals With Disabilities

Transforming Healthcare: Leveraging Vision-Based Neural Networks for Smart Home Patient Monitoring

Deep Learning Based Automated Image Captioning System

Image Caption Generation Using Scoring Based on Object Detection and Word2Vec

MPP-net: Multi-perspective perception network for dense video captioning

Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning

Enhanced transformer model for video caption generation

Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning.

Image Caption Generator Using Attention Based Neural Networks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Caption Generation Research Articles

Related Topics

Articles published on Caption Generation

Image caption generation using transformer learning methods: a case study on instagram image

An Efficient Image Captioning Method Based on Beam Search

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

Deep Learning Approaches on Image Captioning: A Review

Neural image caption generator based on crossbar array design of memristor module

Application of a Short Video Caption Generation Algorithm in International Chinese Education and Teaching

RelNet-MAM: Relation Network with Multilevel Attention Mechanism for Image Captioning

Contrastive topic-enhanced network for video captioning

Combining semi-supervised model and optimized LSTM for image caption generation based on pseudo labels

A latent topic‐aware network for dense video captioning

Parallel Dense Video Caption Generation with Multi-Modal Features

Image Caption Generation Supported Information and Recommender System for Tourists, Including Supplements for Individuals With Disabilities

Transforming Healthcare: Leveraging Vision-Based Neural Networks for Smart Home Patient Monitoring

Deep Learning Based Automated Image Captioning System

Image Caption Generation Using Scoring Based on Object Detection and Word2Vec

MPP-net: Multi-perspective perception network for dense video captioning

Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning

Enhanced transformer model for video caption generation

Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning.

Image Caption Generator Using Attention Based Neural Networks