Abstract

Caption translation aims to translate image annotations (captions for short). Recently, Multimodal Neural Machine Translation (MNMT) has been explored as the essential solution. Besides of linguistic features in captions, MNMT allows visual(image) features to be used. The integration of multimodal features reinforces the semantic representation and considerably improves translation performance. However, MNMT suffers from the incongruence between visual and linguistic features. To overcome the problem, we propose to extend MNMT architecture with a harmonization network, which harmonizes multimodal features(linguistic and visual features)by unidirectional modal space conversion. It enables multimodal translation to be carried out in a seemingly monomodal translation pipeline. We experiment on the golden Multi30k-16 and 17. Experimental results show that, compared to the baseline,the proposed method yields the improvements of 2.2% BLEU for the scenario of translating English captions into German (En→De) at best,7.6% for the case of English-to-French translation(En→Fr) and 1.5% for English-to-Czech(En→Cz). The utilization of harmonization network leads to the competitive performance to the-state-of-the-art.

Highlights

  • A large majority of previous studies tend has been explored as the essential solution

  • The integration of multimodal features reinforces the semantic representation and considerably chine translation model, which consists of three basic components:

  • Multimodal Neural Machine Translation (MNMT) suffers from the incongruence between visual and linguistic features

Read more

Summary

Introduction they are implemented with single-layer attentive

Caption translation is required to translate a sourcelanguage caption into target-language, where a caption refers to the sentence-level text annotation of an image. Such features are obscure for a translation model and useful for translating a verb, such as “pulling” in the caption In this case, incongruence of heterogeneous features results from the unawareness of the correspondence between spatial relationship (“running horses” ahead of “sleigh”) and linguistic semantics (“pulling”). We employ a captioning model to conduct harmonizaels which are specially trained to generate language conditioned on visual features of images During training, it learns to perceive the correspondence between visual and linguistic features, such as that between the spatial relationship of “running dogs ahead of a sled” in Figure 1 and the meaning of the verb “pulling”.

Preliminary 1
Preliminary 2
Resource and Experimental Datasets
Training and Hyperparameter Settings
Performance on Ambiguous COCO
RELATED WORK
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call