Abstract

Multimodal Machine Translation (MMT) aims to enhance translation quality by incorporating information from other modalities (usually images). However, dominant MMT models do not consider that visual features not only provide supplementary information also introduce much noise. In this paper, we propose the visual features filter to solve this issue. Specifically, we adopt a soft-lookup function to select the visual features relevant to the text and then use these visual features as pseudo-words concatenating with a text representation. In addition, our model conducts two-pass decoding. The secondarypass decoding amounts to polishing which can identify errors in draft translations. The reason is that polishing expands the view in the process of decoding each target token, providing more contextual information. Besides, since most words in draft translations can be copied to final translations, we further equip our model with the copying mechanism to reserve those words that do not need to be corrected. MMT has achieved success in some mainstream languages at present. In order to promote the development of MMT in low-resource languages such as Mongolian, we deploy our model to the Mongolian→Chinese translation task. We expand Multi30k dataset to synthetic Mongolian and Chinese descriptions. Experiments on synthetic Mongolian and Chinese datasets demonstrate that our model can bring significant improvements.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call