Multimodal Neural Machine Translation With Weakly Labeled Images

Yoonseok Heo,Donghyun Yoo,Sangwoo Kang

doi:10.1109/access.2019.2911656

Abstract

Machine translation refers to a fully automated process that translates a user’s input text into a target language. To improve the accuracy of machine translation, studies usually exploit not only the input text itself but also various background knowledge related to the text, such as visual information or prior knowledge. Herein, in this paper, we propose a multimodal neural machine translation system that uses both texts and their related images to translate Korean image captions into English. The data in the experiment is a set of unlabeled images only containing bilingual captions. To train the system with a supervised learning approach, we propose a weak-labeling method that selects a keyword from an image caption using feature selection methods. The keywords are used to roughly determine an image label. We also introduce an improved feature selection method using sentence clustering to select keywords that reflect the characteristics of the image captions more accurately. We found that our multimodal system achieves an improved performance compared to a text-only neural machine translation system (baseline). Furthermore, the additional images have positive impacts on addressing the issue of under-translation, where some words in a source sentence are falsely translated or not translated at all.

Full Text