Unsupervised Translation Quality Estimation for Digital Entertainment Content Subtitles

Prabhakar Gupta,Mayank Sharma

doi:10.1142/s1793351x20500026

Abstract

We demonstrate the potential for using aligned bilingual word embeddings in developing an unsupervised method to evaluate machine translations without a need for parallel translation corpus or reference corpus. We explain different aspects of digital entertainment content subtitles. We share our experimental results for four languages pairs English to French, German, Portuguese, Spanish, and present findings on the shortcomings of Neural Machine Translation for subtitles. We propose several improvements over the system designed by Gupta et al. [P. Gupta, S. Shekhawat and K. Kumar, Unsupervised quality estimation without reference corpus for subtitle machine translation using word embeddings, IEEE 13th Int. Conf. Semantic Computing, 2019, pp. 32–38.] by incorporating custom embedding model curated to subtitles, compound word splits and punctuation inclusion. We show a massive run time improvement of the order of [Formula: see text] by considering three types of edits, removing Proximity Intensity Index (PII) and changing post-edit score calculation from their system.

Full Text