Human Versus Automatic Evaluation of NMT for Low-Resource Indian Language

Kusum Gupta,Goutam Datta,Nisheeth Joshi

doi:10.1007/978-981-99-0601-7_55

Abstract

Machine Translation (MT) is an important application of natural language processing that converts a source language to a target language automatically with the injection of a parallel corpus. Researchers from academia and industry are currently very active in designing high-performance translation systems. The MT system has witnessed a big paradigm shift, and a recent neural network-based translation system, i.e., the Neural Machine Translation system, has almost replaced the statistical Machine Translation (SMT) system. This paper mainly focused on two things: firstly, the design of the NMT model and then evaluating its performance with human and automatic metrics on a low-resource Bengali-to-English language pair. Secondly, we have checked the performance of two popular online translators: Google Translate and Bing. We have evaluated the performance of these two translators with the most popular and widely used automatic evaluation metrics, Bilingual Evaluation Understudy (BLEU) and Word Error Rate (WER). BLEU’s evaluation process is primarily based on the n-gram matching approach; hence, sometimes scores are not reliable. WER computes the Levenshtein distance between hypothesis words and reference words. Human evaluation is considered to be the best in MT evaluation. Hence, we have also computed the translation score with the gold-standard human evaluation metric. This research will be helpful as a part in evaluating the performance of various MT engines, especially in domain-specific low-resource language pairs, and also the performance of MT engines can be judged with human evaluators. Also, these human evaluators are used to verify the accuracy of available automatic evaluation metrics.

Full Text