Abstract

Multimodal machine translation which combines visual information of image has become one of the research hotpots in recent years. Most of the existing works project the image feature into the text semantic space and merged into the model in different ways. Actually, different source words may capture different visual information. Therefore, we propose a multimodal neural machine translation (MNMT) model that integrates the words and visual information of image independently. The word itself and different key similarity information of an image are independently fused into the text semantics of the word, thereby assisting and enhancing the textual semantic and corresponding visual information of different words. And then we use them for the calculation of the context vector of the attention of decoder of our model. In this paper, different experiments are carried out on the original English-German sentence pairs of the multimodal machine translation dataset, Multi30k, and the Indonesian-Chinese sentence pairs which is manually annotated by human. Compared with the existing MNMT model based on RNN, our model has a better performance and proves the effectiveness of it.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.