Abstract

In this article, we conduct an extensive quantitative error analysis of different multi-modal neural machine translation (MNMT) models which integrate visual features into different parts of both the encoder and the decoder. We investigate the scenario where models are trained on an in-domain training data set of parallel sentence pairs with images. We analyse two different types of MNMT models, that use global and local image features: the latter encode an image globally, i.e. there is one feature vector representing an entire image, whereas the former encode spatial information, i.e. there are multiple feature vectors, each encoding different portions of the image. We conduct an error analysis of translations generated by different MNMT models as well as text-only baselines, where we study how multi-modal models compare when translating both visual and non-visual terms. In general, we find that the additional multi-modal signals consistently improve translations, even more so when using simpler MNMT models that use global visual features. We also find that not only translations of terms with a strong visual connotation are improved, but almost all kinds of errors decreased when using multi-modal models.

Highlights

  • Neural machine translation (NMT) has recently been successfully tackled as a sequence to sequence learning problem (Kalchbrenner and Blunsom 2013; Cho et al 2014; Sutskever et al 2014)

  • This work aims to provide a comprehensive quantitative error analysis of translations generated with different variants of multi-modal NMT (MNMT) models, the MNMT models introduced in Calixto et al (2017) and Calixto and Liu (2017)

  • We conducted an extensive error analysis of the translations generated by different baselines, a phrase-based statistical MT model (PBSMT) model and a standard attention-based NMT baseline, and by MNMT models that incorporate images into state-of-the-art attention-based NMT by using images as words in the source sentence, to initialise the encoder’s hidden state, as additional data in the initialisation of the decoder’s hidden state, and by means of an additional independent visual attention mechanism

Read more

Summary

Introduction

Neural machine translation (NMT) has recently been successfully tackled as a sequence to sequence (seq2seq) learning problem (Kalchbrenner and Blunsom 2013; Cho et al 2014; Sutskever et al 2014) In this problem, each training example consists of one source and one target variable-length sequence, and there is no prior information regarding the alignments between the two. To mention two rather trivial examples of ambiguity: “The beautiful jaguar is really fast” has an ambiguous noun phrase, and the textual context (“is really fast”) cannot really help disambiguate it; or the classical “The man on the hill saw the boy with a telescope”, which can knowingly have many different interpretations (Church and Patil 1982) In both examples, having an image illustrative of the sentence could be the additional signal that enables the model to arrive at the correct sentence interpretation and translation

Objectives
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call