Abstract

In this paper we report the results of first experiments with HMEANT (a semiautomatic evaluation metric that assesses translation utility by matching semantic role fillers) on the Russian language. We developed a web-based annotation interface and with its help evaluated practicability of this metric in the MT research and development process. We studied reliability, language independence, labor cost and discriminatory power of HMEANT by evaluating English-Russian translation of several MT systems. Role labeling and alignment were done by two groups of annotators - with linguistic background and without it. Experimental results were not univocal and changed from very high inter-annotator agreement in role labeling to much lower values at role alignment stage, good correlation of HMEANT with human ranking at the system level significantly decreased at the sentence level. Analysis of experimental results and annotators’ feedback suggests that HMEANT annotation guidelines need some adaptation for Russian.

Highlights

  • Measuring translation quality is one of the most important tasks in MT, its history began long ago but most of the currently used approaches and metrics have been developed during the last two decades

  • BLEU (Papineni et al, 2002), NIST (Doddington, 2002) and METEOR (Banerjee and Lavie, 2005)metric require reference translation to compare it with MT output in fully automatic mode, which resulted in a dramatical speed-up for MT research and development

  • The underlying annotation cycle of HMEANT consists of two stages: semantic role labeling (SRL) and alignment

Read more

Summary

Introduction

Measuring translation quality is one of the most important tasks in MT, its history began long ago but most of the currently used approaches and metrics have been developed during the last two decades. BLEU (Papineni et al, 2002), NIST (Doddington, 2002) and METEOR (Banerjee and Lavie, 2005)metric require reference translation to compare it with MT output in fully automatic mode, which resulted in a dramatical speed-up for MT research and development These metrics correlate with manual MT evaluation and provide reliable evaluation for many languages and for different types of MT systems. An alternative approach that is worth mentioning is the one proposed by Snover et al (2006), known as HTER, which measures the quality of machine translation in terms of post-editing This method was proved to correlate well with human adequacy judgments, though it was not designed for a task of gisting. HTER is not widely used in machine translation evaluation because of its high labor intensity

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call