English to Hindi Multi-modal Neural Machine Translation and Hindi Image Captioning

Sahinur Rahman Laskar,Rohit Pratap Singh,Partha Pakray,Sivaji Bandyopadhyay

doi:10.18653/v1/d19-5205

Abstract

With the widespread use of Machine Trans-lation (MT) techniques, attempt to minimizecommunication gap among people from di-verse linguistic backgrounds. We have par-ticipated in Workshop on Asian Transla-tion 2019 (WAT2019) multi-modal translationtask. There are three types of submissiontrack namely, multi-modal translation, Hindi-only image captioning and text-only transla-tion for English to Hindi translation. The mainchallenge is to provide a precise MT output.The multi-modal concept incorporates textualand visual features in the translation task. Inthis work, multi-modal translation track re-lies on pre-trained convolutional neural net-works (CNN) with Visual Geometry Grouphaving 19 layered (VGG19) to extract imagefeatures and attention-based Neural MachineTranslation (NMT) system for translation.The merge-model of recurrent neural network(RNN) and CNN is used for the Hindi-onlyimage captioning. The text-only translationtrack is based on the transformer model of theNMT system. The official results evaluated atWAT2019 translation task, which shows thatour multi-modal NMT system achieved Bilin-gual Evaluation Understudy (BLEU) score20.37, Rank-based Intuitive Bilingual Eval-uation Score (RIBES) 0.642838, Adequacy-Fluency Metrics (AMFM) score 0.668260 forchallenge test data and BLEU score 40.55,RIBES 0.760080, AMFM score 0.770860 forevaluation test data in English to Hindi multi-modal translation respectively.

Highlights

The multi-modal translation is an emerging task of the Machine Translation (MT) community, where visual features of image combine with textual features of parallel
Workshop on Asian Translation 2019 (WAT2019) translation task, which shows that our multi-modal NMT system achieved Bilingual Evaluation Understudy (BLEU) score
The multi-modal translation is an emerging task of the MT community, where visual features of image combine with textual features of parallel

Summary

Introduction

The multi-modal translation is an emerging task of the MT community, where visual features of image combine with textual features of parallel. There are three different tracks, namely, multimodal translation, Hindi-only image captioning and text-only translation using NMT system and participated in WAT2019 multi-modal translation task. Multi-modal translation track relies on pre-trained convolutional neural networks (CNN) with Visual Geometry Group having 19 layered (VGG19) to extract image features and attention-based Neural Machine

Results

Conclusion