Learning semantic sentence representations from visually grounded language without lexical knowledge

Danny Merkx,Stefan L Frank

doi:10.1017/s1351324919000196

Danny Merkx, Stefan L Frank

Open Access

https://doi.org/10.1017/s1351324919000196

Copy DOI

Journal: Natural language engineering	Publication Date: Jul 1, 2019
Citations: 8	License type: CC BY 4.0

Affiliation: Radboud University Nijmegen

Abstract

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

Highlights

Distributional semantics, the idea that words that occur in similar contexts have similar meanings, has been around for quite a while (e.g. Rubenstein and Goodenough 1965; Deerwester, Dumais, Furnas et al 1990)
We found that for Flickr8k an upper bound on the learning rate of 10−3 and a lower bound of 10−6 worked well and for Microsoft Common Objects in Context (MSCOCO) we had to adjust the upper bound to 10−4
We investigated whether sentence semantics can be captured in sentence embeddings without using lexical knowledge

Summary

Introduction

Distributional semantics, the idea that words that occur in similar contexts have similar meanings, has been around for quite a while (e.g. Rubenstein and Goodenough 1965; Deerwester, Dumais, Furnas et al 1990). Rubenstein and Goodenough (1965) already studied ‘how the proportion of words common to contexts containing word A and to contexts containing word B was related to the degree to which A and B were similar in meaning’ With the success of word embeddings, researchers are looking for ways to capture the meaning of larger spans of text, such as sentences, paragraphs and even entire documents. Much less is known about how to approach this problem, and early solutions tried to adapt word embedding methods to larger spans of text, for example, Skip-Thought sentence embeddings (Kiros, Zhu, Salakhutdinov et al 2015), FastSent (Hill, Cho, and Korhonen 2016) and Paragraph-Vector (Le and Mikolov 2014), which are related to the Skip-Gram word model by Mikolov et al (2013). There have been successful sentence encoder models which

Objectives

Methods

Results

Conclusion