Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Similar Papers
  • Conference Article
  • Cite Count Icon 10
  • 10.1109/smc52423.2021.9658859
An Information Multiplexed Encoder-Decoder Network for Image Captioning in Hindi
  • Oct 17, 2021
  • Santosh Kumar Mishra + 3 more

Image captioning is a multi-modal problem linking computer vision and natural language processing, which combines image analysis and text generation challenges. In the literature, most of the image captioning works have been accomplished in the English language only. This paper proposes a new approach for image captioning in the Hindi language using deep learning-based encoder-decoder architecture. Hindi, widely spoken in India and South Asia, is the fourth most spoken language globally; it is India’s official language. In recent years, significant advancement has been made in image captioning, utilizing encoder-decoder architectures based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Encoder CNN extracts features from input images, whereas decoder RNN performs language modeling. The proposed encoder-decoder architecture utilizes information multiplexing in the encoder CNN to achieve a performance gain in feature extraction. Extensive experimentation is carried out on the benchmark MSCOCO Hindi dataset, and significant improvements in BLEU score are reported compared to the baselines. Manual human evaluation in terms of adequacy and fluency of the generated captions further establishes the proposed method’s efficacy in generating good quality captions.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icaiic57133.2023.10067039
Captioning Remote Sensing Images Using Transformer Architecture
  • Feb 20, 2023
  • Wrucha Nanal + 1 more

Image Captioning aspires to achieve a description of images with machines as a combination of Computer Vision (CV) and Natural Language Processing (NLP) fields. The current state of the art for image captioning use the Attention-based Encoder-Decoder model. The Attention-based model uses an ‘Attention mechanism’ that focuses on a particular section of the image to generate its corresponding caption word. The NLP side of this model uses Long Short-Term Memory (LSTM) for word generation. Attention-based models did not emphasize the relative arrangement of words in a caption thereby, ignoring the context of the sentence. Inspired by the versatility of Transformers in NLP, this work tries to utilise its architecture features for the Image Captioning use case. This work also makes use of a pretrained Bidirectional Encoder Representation of Transformer (BERT) which generates a contextually rich embedding of a caption. The Multi-Head Attention of the Transformer establishes a strong correlation between the image and contextually aware caption. This experiment is performed on the Remote Sensing Image Captioning Dataset. The results of the model are evaluated using NLP evaluation metrics such as Bilingual Evaluation Understudy 1–4 (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The proposed model shows better results for a few of the metrics.

  • Research Article
  • Cite Count Icon 5
  • 10.22219/kinetik.v7i4.1568
Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model
  • Nov 10, 2022
  • Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control
  • Yufis Azhar + 3 more

Image captioning is one of the biggest challenges in the fields of computer vision and natural language processing. Many other studies have raised the topic of image captioning. However, the evaluation results from other studies are still low. Thus, this study focuses on improving the evaluation results from previous studies. In this study, we used the Flickr8k dataset and the VGG16 Convolutional Neural Networks (CNN) model as an encoder to generate feature extraction from images. Recurrent Neural Network (RNN) uses the Bidirectional Long-Short Term Memory (BiLSTM) method as a decoder. The results of the image feature extraction process in the form of feature vectors are then forwarded to Bidirectional LSTM to produce descriptions that match the input image or visual content. The captions provide information on the object’s name, location, color, size, features of an object, and surroundings. A greedy Search algorithm with Argmax function and Beam-Search algorithm are used to calculate Bilingual Evaluation Understudy (BLEU) scores. The results of the evaluation of the best BLEU scores obtained from this study are the VGG16 model with Bidirectional LSTM using Beam Search with parameter K = 3 and the BLEU-1 score is 0.60593, so this score is superior to previous studies.

  • Research Article
  • Cite Count Icon 30
  • 10.13053/cys-23-3-3269
A Deep Attention based Framework for Image Caption Generation in Hindi Language
  • Oct 7, 2019
  • Computación y Sistemas
  • Rijul Dhir + 3 more

Image captioning refers to the process of generating a textual description for an image which defines the object and activity within the image. It is an intersection of computer vision and natural language processing where computer vision is used to understand the content of an image and language modelling from natural language processing is used to convert an image into words in the right order. A large number of works exist for generating image captioning in English language, but no work exists for generating image captioning in Hindi language. Hindiis the official language of India, and it is the fourth most-spoken language in the world, after Mandarin, Spanish and English. The current paper attempts to bridge this gap. Here an attention-based novel architecture for generating image captioning in Hindi language is proposed. Convolution neural network isused as an encoder to extract features from an input image and gated recurrent unit based neural network isused as a decoder to perform language modelling up to the word level. In between, we have used the attention mechanism which helps the decoder to look into the important portions of the image. In order to show theefficacy of the proposed model, we have first created a manually annotated image captioning training corpusin Hindi corresponding to popular MS COCO English dataset having around 80000 images. Experimental results show that our proposed model attains a BLEU1score of 0.5706 on this data set.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 28
  • 10.3390/s23083835
EFFNet-CA: An Efficient Driver Distraction Detection Based on Multiscale Features Extractions and Channel Attention Mechanism
  • Apr 8, 2023
  • Sensors (Basel, Switzerland)
  • Taimoor Khan + 2 more

Driver distraction is considered a main cause of road accidents, every year, thousands of people obtain serious injuries, and most of them lose their lives. In addition, a continuous increase can be found in road accidents due to driver’s distractions, such as talking, drinking, and using electronic devices, among others. Similarly, several researchers have developed different traditional deep learning techniques for the efficient detection of driver activity. However, the current studies need further improvement due to the higher number of false predictions in real time. To cope with these issues, it is significant to develop an effective technique which detects driver’s behavior in real time to prevent human lives and their property from being damaged. In this work, we develop a convolutional neural network (CNN)-based technique with the integration of a channel attention (CA) mechanism for efficient and effective detection of driver behavior. Moreover, we compared the proposed model with solo and integration flavors of various backbone models and CA such as VGG16, VGG16+CA, ResNet50, ResNet50+CA, Xception, Xception+CA, InceptionV3, InceptionV3+CA, and EfficientNetB0. Additionally, the proposed model obtained optimal performance in terms of evaluation metrics, for instance, accuracy, precision, recall, and F1-score using two well-known datasets such as AUC Distracted Driver (AUCD2) and State Farm Distracted Driver Detection (SFD3). The proposed model achieved 99.58% result in terms of accuracy using SFD3 while 98.97% accuracy on AUCD2 datasets.

  • Research Article
  • Cite Count Icon 42
  • 10.1145/3432246
A Hindi Image Caption Generation Framework Using Deep Learning
  • Mar 15, 2021
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Santosh Kumar Mishra + 3 more

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/esci50559.2021.9396839
Image Captioning Methods and Metrics
  • Mar 5, 2021
  • Omkar Sargar + 1 more

Image Captioning is one of the emerging topics of research in the field of AI. It uses a combination of Computer Vision (CV) and Natural Language Processing (NLP) to derive features from the image, use this information to identify objects, actions, their relationships, and generate a description for the image. It is most important concept in artificial intelligence applied in the fields like aid to the blind, self-driving cars, and many more. This paper we demonstrates a concise state of art image captioning and its method for caption generation using deep learning concepts. We also determine the approach for image caption generation using Convolutional Neural Network (CNN) and Generative Adversarial Network (GAN) model in deep learning framework. Using this approach system intelligent enough to create sentences for images. It uses the encoder-decoder architecture, where CNN is used for image vector generation and LSTM is used for the generation of a logical sentence using the NLP concepts. Finally, we evaluate the proposed system experimental analysis with numerous existing systems and show the effeteness of system.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/hora55278.2022.9799958
Novel Image Caption System Using Deep Convolutional Neural Networks (VGG16)
  • Jun 9, 2022
  • Alaa Sabeeh Salim + 5 more

With advances in artificial intelligence field and computer vision, image captioning (IC) tool has progressively attracted researchers' attention. IC automatically generates natural text descriptions according to the image content. IC combines the knowledge of computer vision and natural language processing. In this article, a Novel Image Captioning system was developed. The final system was validated on Flicker8K dataset. The novel designed system consists of Long Short Time Memory (LSTM) and VGG16 with Convolution Neural Network (CNN). The main improvements of this system are in structure of designed system by adapting batch size. Also, studying deep learning parameters such as Regularization terms that can be added to loss function, CNN Optimizers and Dropout layer. The results showed the effectiveness of the designed system. Finally, this article highlighted some open challenges in the describing images task.

  • Research Article
  • Cite Count Icon 8
  • 10.14569/ijacsa.2023.0140326
An Efficient Deep Learning based Hybrid Model for Image Caption Generation
  • Jan 1, 2023
  • International Journal of Advanced Computer Science and Applications
  • Mehzabeen Kaur + 1 more

In the recent yeas, with the increase in the use of different social media platforms, image captioning approach play a major role in automatically describe the whole image into natural language sentence. Image captioning plays a significant role in computer-based society. Image captioning is the process of automatically generating the natural language textual description of the image using artificial intelligence techniques. Computer vision and natural language processing are the key aspect of the image processing system. Convolutional Neural Network (CNN) is a part of computer vision and used object detection and feature extraction and on the other side Natural Language Processing (NLP) techniques help in generating the textual caption of the image. Generating suitable image description by machine is challenging task as it is based upon object detection, location and their semantic relationships in a human understandable language such as English. In this paper our aim to develop an encoder-decoder based hybrid image captioning approach using VGG16, ResNet50 and YOLO. VGG16 and ResNet50 are the pre-trained feature extraction model which are trained on millions of images. YOLO is used for real time object detection. It first extracts the image features using VGG16, ResNet50 and YOLO and concatenate the result in to single file. At last LSTM and BiGRU are used for textual description of the image. Proposed model is evaluated by using BLEU, METEOR and RUGE score.

  • Research Article
  • Cite Count Icon 60
  • 10.1145/3009906
Computer Vision and Natural Language Processing
  • Dec 12, 2016
  • ACM Computing Surveys
  • Peratham Wiriyathammabhum + 3 more

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/iceca55336.2022.10009435
A Comparative Study on Optimizers for Automatic Image Captioning
  • Dec 1, 2022
  • Eliyah Immanuel Thavaraj A + 2 more

In the field of artificial intelligence, computer vision and natural language processing are used to automatically generate an image's contents. The regenerative neuronal model is developed and is dependent on machine translation and computer vision. Using this technique, natural phrases are produced that finally explain the image. This architecture also includes recurrent neural networks (RNN) and convolutional neural networks (CNN). The RNN is used to create phrases, whereas the CNN is used to extract characteristics from images. The model has been taught to produce captions that, when given an input image, almost exactly describe the image. The outcome of these algorithms is determined by several factors, including feature extraction, caption generation, and optimizer selection. Our goal is to conduct a comparative analysis of several optimizers to determine the optimizer that achieves highest accuracy for a deep learning model. The deep learning model is subsequently trained with various optimizers on the Flicker dataset. The accuracy of the results of the model using optimizers are achieved as follows: RMSprop optimizer has a 92% accuracy, SGD has a 12% accuracy, Adam optimizer has 53% accuracy, and Adadelta has a 12% per cent.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 119
  • 10.1155/2020/3062706
An Overview of Image Caption Generation Methods
  • Jan 9, 2020
  • Computational Intelligence and Neuroscience
  • Haoran Wang + 2 more

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-60633-6_18
Lightweight Image Super-resolution with Local Attention Enhancement
  • Jan 1, 2020
  • Yunchu Yang + 3 more

In recent years, methods based on convolutional neural network (CNN) have been the mainstream in single image super-resolution (SISR). Although these methods have achieved excellent performance, the massive amount of model parameters and heavy computation limit their application. On the other hand, channel attention (CA) mechanism, which can enhance network performance, has also been widely used in SR task recently. However, the channel attention mechanism is introduced from high-level vision tasks to the SR task. The original design of this mechanism doesn’t consider the specificity of the SR task. To address these issues, we propose a lightweight expansion and distillation residual network (EDRN) for image super-resolution. Specifically, through the diverse use of different feature channels and different convolution kernel sizes, our network can effectively reduce the amount of parameters while achieving superior performance. To further explore the potential of channel-wise attention in the SR task, we develop a novel plug-and-play local channel attention enhancement strategy (LCAES) to make the network better use the characteristics of local features of the image. Furthermore, comprehensive quantitative and qualitative evaluations demonstrate that the proposed method performs favorably against state-of-the-art SR algorithms in terms of visual quality, reconstruction accuracy, and parameter amount.

  • Conference Article
  • Cite Count Icon 2
  • 10.1117/12.2599421
Cross-layer channel attention mechanism for convolutional neural networks
  • Jun 30, 2021
  • Ying He + 2 more

Recently channel attention mechanism has been widely used to improve the performance of convolutional neural networks. However, most channel attention mechanisms applied to the backbone convolutional neural networks of the computer vision use the global pooling features of the output from each block to obtain the channel attention weights of corresponding channels, ignore the spatial information of the corresponding original features and the potential relationship between adjacent layers. For insufficient utilization of space information of origin features and inability to adaptively learn the potential association of all features in a block before the process of producing channel attention weights, we propose a new Cross-layer Channel Attention Mechanism(CCAM), in which a matrix with spatial information is used to replace the global pooling operation, uses the input and output features of each block as the inputs, and outputs the channel attention weights of corresponding features simultaneously. Compared with other attention mechanisms, the CCAM have the following three advantages: first, it makes full use of the spatial information of each layer of features; second, it encourages feature reuse and fusion; third, it is better at discovering the potential relationship between the features of different layers in a block. Our simulation results have demonstrated that CCAM can effectively extract the attention weights of diffident layers, and achieve better performance on CIFAR- 10, CIFAR-100, ImageNet-1K, MS COCO detection, and VOC detection with small additional computational cost compared with the corresponding convolutional neural network.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/rteict.2017.8256949
Effect of image colourspace on performance of convolution neural networks
  • May 1, 2017
  • K Sumanth Reddy + 2 more

Recently the term Deep Learning has been creating a lot of interest in the fields of Artificial Intelligence, Computer Vision and Natural Language Processing. And especially the Convolution Neural Networks (CNN) are giving state of art results in image recognition, scene understanding, object detection and image description etc. Generally in CNN the processing of images is done in RGB colourspace even though we have many other colourspaces available. In this paper we try to understand the effect of image colourspace on the performance of CNN models in recognizing the objects present in the image. We evaluate this on CIFAR10 dataset, by converting all the original RGB images into four other colourspaces like HLS, HSV, LUV, YUV etc. To compare results we have trained AlexNet with fixed set of parameters on all five colourspaces, including RGB. We have observed that LUV colourspace is the best alternative to RGB colourspace to use with CNN models with almost equal performance on the test set of CIFAR10 dataset. While YUV colourspace is the worst to use with CNN models.

Save Icon
Up Arrow
Open/Close