Abstract

Several recent studies have shown the benefits of combining language and perception to infer word embeddings. These multimodal approaches either simply combine pre-trained textual and visual representations (e.g. features extracted from convolutional neural networks), or use the latter to bias the learning of textual word embeddings. In this work, we propose a novel probabilistic model to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus. Our approach learns textual and visual representations jointly: latent visual factors couple together a skip-gram model for co-occurrence in linguistic data and a generative latent variable model for visual data. Extensive experimental studies validate the proposed model. Concretely, on the tasks of assessing pairwise word similarity and image/caption retrieval, our approach attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.

Highlights

  • Continuous-valued vector representation of words has been one of the key components in neural architectures for natural language processing (Mikolov et al, 2013; Pennington et al, 2014; Levy and Goldberg, 2014)

  • We develop a new model which jointly learns word embeddings from text and extracts latent visual information, from pre-computed visual features, that could supplement the linguistic embeddings in modeling the co-occurrence of words Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1478–1487 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics and their contexts in a corpus

  • We propose PIXIE, a novel probabilistic model joining textual and perceptual information to infer multimodal word embeddings

Read more

Summary

Introduction

Continuous-valued vector representation of words has been one of the key components in neural architectures for natural language processing (Mikolov et al, 2013; Pennington et al, 2014; Levy and Goldberg, 2014). The embeddings produced by such models do not necessarily reflect all inherent aspects of human semantic knowledge, such as the perceptual aspect (Feng and Lapata, 2010) This has motivated many researchers to explore different ways to infuse visual information, often represented in the form of pre-computed visual features, into word embeddings (Kiela and Bottou, 2014; Silberer et al, 2017; Collell et al, 2017; Lazaridou et al, 2015). The extracted visual factors can improve the modeling of word-context co-occurrences in text data Another appealing property of our model is its natural ability to propagate perceptual information to the embeddings of words lacking visual features (e.g., abstract words) during learning. We show its matching or stronger performance when compared to other state-of-the-art approaches for learning multimodal embeddings

Setup and Background
Joint Visual and Text Modeling
Approximate Inference and Learning
Related Work
Experiments
Task 1
Main results
Qualitative analysis
Task 2
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.