Supervised Visual Attention for Multimodal Neural Machine Translation

Tetsuro Nishihara,Hideki Nakayama,Akihiro Tamura,Takashi Ninomiya,Yutaro Omote

doi:10.18653/v1/2020.coling-main.380

Abstract

This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).

Highlights

As mainstream machine translation, Neural Machine Translation (NMT) model, widely used since the early days, is the Recurrent Neural Network (RNN)-based NMT with attention mechanism (Luong et al, 2015)
This paper proposes a supervised visual attention mechanism trained with constraints based on manual alignments between words in a sentence and their corresponding image regions to improve multimodal neural machine translation (MNMT)
We introduce the supervised cross-lingual attention explained in Section 2.2 to our MNMT model to improve translation performance

Summary

Introduction

Neural Machine Translation (NMT) model, widely used since the early days, is the Recurrent Neural Network (RNN)-based NMT with attention mechanism (Luong et al, 2015) This model achieves higher translation accuracy than conventional RNN-based NMT by using a cross-lingual attention mechanism that captures the relationship between words in source and target language sentences. We experimented with English-German and German-English translation using the Multi30k dataset (Elliott et al, 2016) and with English-Japanese and Japanese-English translation using the Flickr30k Entities JP dataset (Nakayama et al, 2020) These experiments show that the proposed supervised visual attention mechanism improves a Transformer-based MNMT model’s performance (i.e., METEOR and BLEU scores)

Background

Transformer NMT

Supervised cross-lingual attention

Proposed method

Architecture of Transformer-based MNMT model

Supervised training for the visual attention mechanism

Supervised training of visual attention and cross-lingual attention

Experiments

Examples of visual attentions

Examples of translations

Experiments with manual word alignments

Related Work

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Supervised Visual Attention for Multimodal Neural Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 15	License type: cc-by

Similar Papers

Supervised Visual Attention for Multimodal Neural Machine Translation
Tetsuro Nishihara ... Takashi Ninomiya
Journal of Natural Language Processing | VOL. 28
Tetsuro Nishihara, et. al.Tetsuro Nishihara ... Takashi Ninomiya
01 Jan 2020
Journal of Natural Language Processing | VOL. 28

Multimodal Neural Machine Translation Using CNN and Transformer Encoder
Hiroki Takushima ... Takashi Ninomiya
-
Hiroki Takushima, et. al.Hiroki Takushima ... Takashi Ninomiya
02 Apr 2019
02 Apr 2019

Multimodal Machine Translation
Jiatong Liu
IEEE Access | VOL. -
Jiatong LiuJiatong Liu
01 Jan 2024
IEEE Access | VOL. -

Multi-modal neural machine translation with deep semantic interactions
Jinsong Su ... Yongxuan Lai
Information Sciences | VOL. 554
Jinsong Su, et. al.Jinsong Su ... Yongxuan Lai
28 Nov 2020
Information Sciences | VOL. 554

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Supervised Visual Attention for Multimodal Neural Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers