Multimodal Neural Machine Translation Using CNN and Transformer Encoder

Akihiro Tamura,Hiroki Takushima,Hideki Nakayama,Takashi Ninomiya

doi:10.29007/hxhn

Abstract

Multimodal machine translation uses images related to source language sentences as inputs to improve translation quality. Previous multimodal Neural Machine Translation (NMT) models, which incorporate visual features of each image region into an encoder for source language sentences or an attention mechanism between an encoder and a decoder, cannot catch the relation between visual features from each image region. This paper proposes a new multimodal NMT model, which encodes an input image using a Convolutional Neural Network (CNN) and a Transformer encoder. In particular, the proposed image encoder first extracts visual features from each image region using a CNN, and then encodes an input image on the basis of the extracted visual features using a Transformer encoder, where the relation between visual features from each image region are captured by a self-attention mechanism of the Transformer encoder. The experiments on the English-German translation task using the Multi30k data set show that the proposed model achieves 0.96 BLEU points improvement against a baseline Transformer NMT model without image inputs and 0.47 BLEU points improvement against a baseline multimodal Transformer NMT model without a Transformer encoder for images.

Full Text