Learn and Tell: Learning Priors for Image Caption Generation

Pei Liu,Ming Zhang,Dezhong Peng

doi:10.3390/app10196942

Abstract

In this work, we propose a novel priors-based attention neural network (PANN) for image captioning, which aims at incorporating two kinds of priors, i.e., the probabilities being mentioned for local region proposals (PBM priors) and part-of-speech clues for caption words (POS priors), into a visual information extraction process at each word prediction. This work was inspired by the intuitions that region proposals have different inherent probabilities for image captioning, and that the POS clues bridge the word class (part-of-speech tag) with the categories of visual features. We propose new methods to extract these two priors, in which the PBM priors are obtained by computing the similarities between the caption feature vector and local feature vectors, while the POS priors are predicated at each step of word generation by taking the hidden state of the decoder as input. After that, these two kinds of priors are further incorporated into the PANN module of the decoder to help the decoder extract more accurate visual information for the current word generation. In our experiments, we qualitatively analyzed the proposed approach and quantitatively evaluated several captioning schemes with our PANN on the MS-COCO dataset. Experimental results demonstrate that our proposed method could achieve better performance as well as the effectiveness of the proposed network for image captioning.

Highlights

The task of image caption aims at automatically giving a natural language description for an input image by using pre-designed algorithms operated on a computer [1,2], and this research lies at the intersection of two currently prevalent research fields, i.e., Computer Vision (CV) and Natural Language Process (NLP)
Caption models consist of two main parts, i.e., a encoder and a decoder, in which the encoder employs convolutional neural networks (CNNs) to extract visual features as image representation, and the decoder usually adopts a recurrent neural network (RNN) [1,3,4,5,6] or an attention-based neural network [7,8,9,10,11] to decode visual features into flexible length sequences
Microsoft COCO (MS-COCO), demonstrating that the proposed method outperforms several current state-of-the-art approaches in most metrics, and that the proposed priors-based attention neural network (PANN)

Summary

Introduction

The task of image caption aims at automatically giving a natural language description for an input image by using pre-designed algorithms operated on a computer [1,2], and this research lies at the intersection of two currently prevalent research fields, i.e., Computer Vision (CV) and Natural Language Process (NLP). These technologies have wide applications in our daily life, such as giving descriptive captions for retrieval and image indexing, giving robots stronger man–machine communication abilities, automatic video security monitoring, and helping people with visual impairments by translating visual signals into information that could be communicated through text-to-speech technology.

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Oct 4, 2020
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Learn and Tell: Learning Priors for Image Caption Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines
A.D Dileep ... C Chandra Sekhar
Speech Communication | VOL. 57
A.D Dileep, et. al.A.D Dileep ... C Chandra Sekhar
07 Oct 2013
Speech Communication | VOL. 57

Heart biometrics based on ECG signal by sparse coding and bidirectional long short-term memory
Yefei Zhang ... Xiaohong Zhang
Multimedia Tools and Applications | VOL. 80
Yefei Zhang, et. al.Yefei Zhang ... Xiaohong Zhang
26 Aug 2020
Multimedia Tools and Applications | VOL. 80

Center-Symmetric Local Binary Pattern-Based Image Authentication Using Local and Global Features Vector
Ashis Dey ... Prabhas Kumar Singh
-
Ashis Dey, et. al.Ashis Dey ... Prabhas Kumar Singh
01 Jan 2020
01 Jan 2020

Concept Level Discriminant Analysis Techniques for Dimension Reduction in Image Classification Tasks
Ramya K Raman ... C Chandra Sekhar
-
Ramya K Raman, et. al.Ramya K Raman ... C Chandra Sekhar
14 Dec 2014
14 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learn and Tell: Learning Priors for Image Caption Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences