Abstract
In recent years, neural network machine translation, especially in the field of multimodality, has developed rapidly. It has been widely used in natural languages processing tasks such as event detection and sentiment classification. The existing multimodal neural network machine translation is mostly based on the autoencoder framework of the attention mechanism, which further integrates spatial-visual features. However, due to the ubiquitous lack of corpus and the semantic interaction between multimodalities, the quality of machine translation is difficult to guarantee. Therefore, this paper proposes a multi-modal machine translation model that integrates external linguistic knowledge. Specifically, on the encoder side, we adopt the pre-trained Bert model to be used as an additional encoder to integrate with the original text encoder and picture encoder. Under the cooperation of the three encoders, a better text representation and picture representation at the source end is generated. Besides, the decoder decodes and generates a translation based on the image and text representation of the source. To sum up, this paper studies the visual-text semantic interaction on the encoder side and the visual-text semantic interaction on the decoder side, and further improves the quality of translation by introducing external linguistic knowledge. We compared the performance of the multimodal neural network machine translation model with pre-trained Bert and other baseline models in English German translation tasks on the multi30k data sets. The results show that the model can significantly improve the quality of multimodal neural network machine translation, which also verifies the importance of integrating external knowledge and visual text semantic interaction.
Highlights
T HE real world that human beings live in is a space where text, sound, image, and video coexist
As Kalchbrenner et al proposed the concept of neural network machine translation in 2013, it soon achieved results comparable to, or even better than, traditional statistical machine translation, and it has gradually become a research hotspot
Ii) Experiments were performed on multiple language pairs on the Multi30k data set, and the results all show that the model in this paper can significantly improve the quality of multimodal neural network machine translation
Summary
T HE real world that human beings live in is a space where text, sound, image, and video coexist. With a large number of pre-training techniques/models, such as: ELMo (Peters et al, 2018), GPT/GPT-2 (Radford et al, 2018) , BERT (Devlinet al., 2019) and cross-language language XLM (Lample & Conneau, 2019), XLNet (yang et al, 2019b) ,RoBERTa (Liu et al, 2019) and other models have refreshed the performance records in the corresponding field time and time again, and the pre-training technology has attracted widespread attention from the machine learning and natural language processing communities These models are pre-trained on a large amount of unlabeled data to better learn the representation of the model input. Multimodal neural network machine translation aims to simultaneously use source language sentences and corresponding visual information to obtain high-quality target language translations In this process, the input sentence and image need to be encoded first, and decoded to generate the target language sentence.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.