Abstract
Traditional machine translation mainly realizes the introduction of static images from other modal information to improve translation quality. In processing, a variety of methods are combined to improve the data and features, so that the translation result is close to the upper limit, and some even need to rely on the sensitivity of the sample distance algorithm to the data. At the same time, multi-modal MT will cause problems such as lack of semantic interaction in the attention mechanism in the same corpus, or excessive encoding of the same text image information and corpus irrelevant information, resulting in excessive noise. In order to solve these problems, this article proposes a new input port that adds visual image processing to the decoder. The core idea is to combine visual image information with traditional attention mechanisms at each time step specific to decoding. The dynamic router extracts the relevant visual features, integrates the multi-modal visual features into the decoder, and predicts the target word by introducing the visual image process. At the same time, experiments were carried out on more than 30K datasets translated in the United Kingdom, France and the Czech Republic, which proved the superiority of adding visual images to the decoder to extract features.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.