Multimodal Neural Machine Translation Model Research Articles

As an important extension of conventional text-only neural machine translation (NMT), multi-modal neural machine translation (MNMT) aims to translate input source sentences paired with images into the target language. Although a lot of MNMT models have been proposed to perform multi-modal semantic fusion, they do not consider fine-grained semantic correspondences between semantic units of different modalities (i.e., words and visual objects), which can be exploited to refine multi-modal representation learning via fine-grained semantic interactions. To address this issue, we propose a graph-based multi-modal fusion encoder for NMT. Concretely, we first employ a unified multi-modal graph to represent the input sentence and image, in which the multi-modal semantic units are considered as the nodes in the graph, connected by two kinds of edges with different semantic relationships. Then, we stack multiple graph-based multi-modal fusion layers that iteratively conduct intra- and inter-modal interactions to learn node representations. Finally, via an attention mechanism, we induce a multi-modal context from the top node representations for the decoder. Particularly, we introduce a progressive contrastive learning strategy based on the multi-modal graph to refine the training of our proposed model, where hard negative samples are introduced gradually. To evaluate our model, we conduct experiments on commonly-used datasets. Experimental results and analysis show that our MNMT model obtains significant improvements over competitive baselines, achieving state-of-the-art performance on the Multi30K dataset.

Read full abstract

Based on the conventional attentional encoder-decoder framework, multi-modal neural machine translation (NMT) further incorporates spatial visual features through a separate visual attention mechanism. In this aspect, most current multi-modal NMT models first separately learn the semantic representations of text and image and then independently produce two modalities of context vectors for word predictions, neglecting their semantic interactions. In this paper, we argue that learning text-image semantic interactions is more reasonable in the sense of jointly modeling two modalities for multi-modal NMT and propose a novel multi-modal NMT model with deep semantic interactions. Specifically, our model extends the conventional multi-modal NMT by introducing the following two attention neural networks: (1) a bi-directional attention network for modeling text and image representations, where the semantic representations of text are learned by referring to the image representations, and vice versa; (2) a co-attention network for refining text and image context vectors, which first summarizes the text into a context vector, then attends it to the image for obtaining the text-aware visual context vector. The final context vector is calculated by re-attending the visual context vector to the text. Results on the Multi30k dataset for different language pairs show that our model significantly improves on the state-of-the-art baselines. We have released our code athttps://github.com/DeepLearnXMU/MNMT.

Read full abstract

Multimodal Neural Machine Translation Model Research Articles

Related Topics

Articles published on Multimodal Neural Machine Translation Model

Multi-modal graph contrastive encoding for neural machine translation

Learning to decode to future success for multi-modal neural machine translation

Word-Region Alignment-Guided Multimodal Neural Machine Translation

Supervised Visual Attention for Multimodal Neural Machine Translation

Multi-modal neural machine translation with deep semantic interactions

An error analysis for image-based multi-modal neural machine translation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Neural Machine Translation Model Research Articles

Related Topics

Articles published on Multimodal Neural Machine Translation Model

Multi-modal graph contrastive encoding for neural machine translation

Learning to decode to future success for multi-modal neural machine translation

Word-Region Alignment-Guided Multimodal Neural Machine Translation

Supervised Visual Attention for Multimodal Neural Machine Translation

Multi-modal neural machine translation with deep semantic interactions

An error analysis for image-based multi-modal neural machine translation