Abstract

Video modality is an emerging research area among the numerous modalities utilized for multimodal machine translation. Multimodal machine translation uses multiple modalities to improve the machine-translated target language from the source language. However, the currently available multimodal dataset is focused on a few well-studied languages. In this paper, we propose a video-guided multimodal machine translation (VMMT) model under a low-resource setting by building a synthetic multimodal dataset of the English-Hindi language pair, the first one of its kind for this language pair. The VMMT system employs spatio-temporal video context as an additional input modality along with the source text. The spatio-temporal video context is extracted using a pre-trained 3D convolutional neural network. We report how well the VMMT systems outperform the text-only neural machine translation (NMT) system using automatic evaluation metrics and human evaluation on two test datasets: one in-domain and another out-domain. Our results indicate that the use of video context as an additional input modality enhances the performance of the MT system in resolving the various MT challenges, such as handling rare words, ambiguity, etc., in both English→Hindi and Hindi→English translations. Our experimental results show a significant improvement of up to +4.2 BLEU and +0.07 chrF scores in English→Hindi and +5.4 BLEU and +0.07 chrF scores in Hindi→English with our VMMT system over unimodal NMT system. Our findings highlight the potential of visual cues as an additional modality for improving machine translation systems especially in low-resource settings and emphasize the importance of synthetic multimodal datasets in addressing the scarcity of diverse data for less-studied language pairs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call