Abstract
This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).
Highlights
As mainstream machine translation, Neural Machine Translation (NMT) model, widely used since the early days, is the Recurrent Neural Network (RNN)-based NMT with attention mechanism (Luong et al, 2015)
This paper proposes a supervised visual attention mechanism trained with constraints based on manual alignments between words in a sentence and their corresponding image regions to improve multimodal neural machine translation (MNMT)
We introduce the supervised cross-lingual attention explained in Section 2.2 to our MNMT model to improve translation performance
Summary
Neural Machine Translation (NMT) model, widely used since the early days, is the Recurrent Neural Network (RNN)-based NMT with attention mechanism (Luong et al, 2015) This model achieves higher translation accuracy than conventional RNN-based NMT by using a cross-lingual attention mechanism that captures the relationship between words in source and target language sentences. We experimented with English-German and German-English translation using the Multi30k dataset (Elliott et al, 2016) and with English-Japanese and Japanese-English translation using the Flickr30k Entities JP dataset (Nakayama et al, 2020) These experiments show that the proposed supervised visual attention mechanism improves a Transformer-based MNMT model’s performance (i.e., METEOR and BLEU scores)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.