Abstract

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.

Highlights

  • Language and vision are the two most essential parts of human intelligence for interpreting the real world around us and communicating with each other

  • Our work is an extension of Visual semantic embedding (VSE)++ [21]. e main contribution is that we improve the performance by text data augmentation and triplet ranking loss with hard negative mining, which are discussed in the sections titled “Text Data Augmentation” and “Triplet Ranking Loss with Hard Negative Mining.”

  • We utilize the VSE++ [21] of our implementation as the baseline and compare the results with the models mentioned in the section titled “Dual-Normalized Visual Semantic Embedding Learning.”

Read more

Summary

Introduction

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us and communicating with each other. The union of these systems will be important in the research of human intelligence and artificial intelligence. With the rapid development of machine learning (ML), especially deep learning (DL) [1], we get breakthroughs on both separate and union levels of language and vision processing. Visual semantic embedding (VSE) is proposed for tackling the problem. In the research of visual semantic, datasets usually provide an image with its corresponding description, and the given description is like a single word/ phrase or a sentence. Is allows us to unify the image representation [2] and word representation/embedding [3] into the same feature space. In the research of visual semantic, datasets usually provide an image with its corresponding description, and the given description is like a single word/ phrase or a sentence. is allows us to unify the image representation [2] and word representation/embedding [3] into the same feature space. e visual semantic embedding learns a representation that allows semantically associated paired image and text into the same space; that is, visual semantic embedding learns a common feature space that represents the underlying domain structure, and their embeddings of image and text are semantically meaningful

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.