Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion

Erlin Tian,Zengchao Zhu,Fangmei Liu,Zuhe Li,Ran Gu,Shuai Zhao

doi:10.3390/electronics13153084

Abstract

Existing research on multimodal machine translation (MMT) has typically enhanced bilingual translation by introducing additional alignment visual information. However, picture form requirements in multimodal datasets pose important constraints on the development of MMT because this requires a form of alignment between image, source text, and target text. This limitation is especially compounded by the fact that the inference phase, when aligning images, is not directly available in a conventional neural machine translation (NMT) setup. Therefore, we propose an innovative MMT framework called the DSKP-MMT model, which supports machine translation by enhancing knowledge distillation and feature refinement methods in the absence of images. Our model first generates multimodal features from the source text. Then, the purified features are obtained through the multimodal feature generator and knowledge distillation module. The features generated through image feature enhancement are subsequently further purified. Finally, the image–text fusion features are generated as input in the transformer-based machine translation reasoning task. In the Multi30K dataset test, the DSKP-MMT model has achieved a BLEU of 40.42 and a METEOR of 58.15, showing its ability to improve translation effectiveness and facilitating utterance communication.

Full Text