This paper introduces current-day automated metrics for neural machine translation evaluation and discusses two ways of result evaluation, which include evaluation using specific automated metrics as well as evaluation by an expert translator. Neural machine translation systems were tested for analysis and evaluation purposes. Google Translate and DeepL Translate systems, which apply a neural network approach, are presented as candidates. The following metrics were considered for evaluation: METEOR as a traditional reference-based metric, COMET as a neural reference-based metric, and COMET-kiwi as a neural reference-free metric. It can be noted that even if these metrics are automated, there is a huge human impact in this automation. Even models with a neural network approach are trained on data provided by humans, as today it is impossible to get rid of reference translations or quality estimations made by experts. Metrics allow to understand and better investigate machine translation, its features and limitations, as well as to determine the direction of its development. As a part of the analysis, a piece of source text was selected for translation, translated into the target languages using selected neural machine translation systems to get candidate translations, and then a reference translation was specified for each of them. The results of the metrics evaluation were helpful for understanding how close machine translation is to human translation, and also helped to learn more about the current stage of machine translation systems development. Expert evaluation helped to understand how well such systems are doing in terms of translation performance.
Read full abstract