Abstract

The combination of visual recognition and language understanding is aim to build a commonly shared space between heterogeneous data of vision and text, such as the tasks of image captioning and visual question answering (VQA). Most existing approaches convert an image into a semantic visual feature vector via deep convolutional neural networks (CNN), while keep the sequential property of text data and represent it with Recurrent Neural Networks(RNN). The key to analyse multi-source heterogeneous data is to construct the inherent correlations between data. In order to reduce the heterogeneous gap among the vision and language, in this work, we represent the image in a sequential way as well as the text. We utilize the objects in the visual scenes and convert the image to a sequence of detected objects and their locations. Then we analogize a sequence of objects(visual language) to a sequence of words(natural language). We take the order of objects into account and evaluate different permutations and combinations of objects. Experimental results on the image captioning and VQA benchmarks demonstrate our hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language(V2L) problems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call