Burrows’ Delta as a Convergent Validator: Stylometric Analysis for Complementary Machine Translation Evaluation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ABSTRACT Machine translation (MT) systems are typically evaluated by comparing outputs to human references using metrics that approximate adequacy and fluency, but these metrics are not designed to measure stylistic fidelity, i.e. how closely an output matches the target-language stylistic profile of a high-quality human literary translation. We test whether stylometric distance, operationalized with Burrows’ Delta over the 500 most frequent words, can serve as a convergent validator of adequacy signals while providing interpretable, reference-free diagnostics. Using nine contemporary Greek short stories with author-produced English self-translations and MT outputs, segmented into non-overlapping five-sentence windows, we compare an inverted, min–max normalized Burrows’ Delta score (invΔ B ) against standard reference-based MT metrics (BLEU, chrF2, TER, BLEURT, COMET, BERTScore) and against an adequacy composite (TQI_win). We find strong convergence between stylometric proximity and adequacy signals, particularly at decision-relevant extremes, but stylometry underperforms adequacy metrics when used alone and provides no incremental predictive benefit beyond semantic-embedding baselines. We conclude that stylometry is best used as a complementary, explainable diagnostic and as a constrained reference-free monitor and not as a substitute for adequacy-oriented MT evaluation.

Similar Papers
  • Conference Article
  • Cite Count Icon 1
  • 10.5339/qfarc.2018.ictpd885
Toward a Cognitive Evaluation Approach for Machine Translation PostEditing
  • Jan 1, 2018
  • Wajdi Zaghouani + 1 more

Toward a Cognitive Evaluation Approach for Machine Translation PostEditing

  • Research Article
  • Cite Count Icon 15
  • 10.1007/s10579-018-9430-2
VERTa: a linguistic approach to automatic machine translation evaluation
  • Oct 15, 2018
  • Language Resources and Evaluation
  • Elisabet Comelles + 1 more

Machine translation (MT) is directly linked to its evaluation in order to both compare different MT system outputs and analyse system errors so that they can be addressed and corrected. As a consequence, MT evaluation has become increasingly important and popular in the last decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations in order to compare system output, and the most well-known and widely spread work at lexical level. In this study we describe and present a linguistically-motivated metric, VERTa, which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. Before designing and developing VERTa a qualitative linguistic analysis of data was performed so as to identify the linguistic phenomena that an MT metric must consider (Comelles et al. 2017). In the present study we introduce VERTa’s design and architecture and we report the experiments performed in order to develop the metric and to check the suitability and interaction of the linguistic information used. The experiments carried out go beyond traditional correlation scores and step towards a more qualitative approach based on linguistic analysis. Finally, in order to check the validity of the metric, an evaluation has been conducted comparing the metric’s performance to that of other well-known state-of-the-art MT metrics.

  • Research Article
  • Cite Count Icon 12
  • 10.3233/jifs-169504
Detecting errors in machine translation using residuals and metrics of automatic evaluation
  • May 18, 2018
  • Journal of Intelligent & Fuzzy Systems
  • Michal Munk + 1 more

Errors and residuals are closely related measures of the deviation. An error is a deviation of the observed value (PEMT output) from the expected value (MT output), while the residual of the observed value is the difference between the observed and predicted value of quality. We propose an exploratory data technique representing an ideal instrument to evaluate and improve machine translation (MT) systems. The main contribution consists of a rigorous technique (a statistical method), novel to the research of MT evaluation given by residual analysis to identify differences between MT output and post-edited machine translation output regarding human translation (reference). The residual analysis of the automatic metrics can help us to discover significant differences between MT and PEMT and to identify questionable issues regarding the one reference. In this study, we show the usage of residuals in MT evaluation. Using residual analysis, we identified sentences, in which significant differences were found in the scores of automatic metrics between MT output and post-edited (PE) MT output from Slovak into English.

  • Research Article
  • Cite Count Icon 11
  • 10.5539/ijel.v10n2p184
Teaching Arabic Machine Translation to EFL Student Translators: A Case Study of Omani Translation Undergraduates
  • Feb 5, 2020
  • International Journal of English Linguistics
  • Yasser Muhammad Naguib Sabtan

The present paper describes a machine translation (MT) course taught to undergraduate students in the Department of English Language and Literature at Dhofar University in Oman. The course is one of the major requirements for BA in Translation. Fifteen EFL translation students who were in their third year of study were enrolled in the course. The author presents both the theoretical and practical parts of the course. In the theoretical part, the topics covered in the course are outlined. As for the practical part, it focuses on the translation students’ post-editing of online MT output. This is beneficial to the students as free online MT systems can potentially be used as a means for improving student translators’ training and EFL learning. This is achieved through subjecting MT output to analysis or post-editing by the students so that they can focus on the differences between the source and target languages. With this goal in mind, assignments were given to the students to post-edit the Arabic and English MT output of three free online MT systems (Systran, Babylon and Google Translate), discuss the linguistic problems that they spot for each system and choose the one that has the fewest number of errors. The results show that the students, with varying degrees of success, have managed to identify some linguistic errors with the MT output for each MT system and thus produced a better human translation. The paper concludes that there is a need to incorporate MT courses in translation departments in the Arab world, as integrating technology into translation curricula will have great effect on student translators’ training for their future career as professional translators.

  • Book Chapter
  • Cite Count Icon 20
  • 10.1007/978-1-4419-7713-7_5
Machine Translation Evaluation and Optimization
  • Jan 1, 2011
  • Bonnie Dorr + 3 more

The evaluation of machine translation (MT) systems is a vital field of research, both for determining the effectiveness of existing MT systems and for optimizing the performance of MT systems. This part describes a range of different evaluation approaches used in the GALE community and introduces evaluation protocols and methodologies used in the program. We discuss the development and use of automatic, human, task-based and semi-automatic (human-in-the-loop) methods of evaluating machine translation, focusing on the use of a human-mediated translation error rate HTER as the evaluation standard used in GALE. We discuss the workflow associated with the use of this measure, including post editing, quality control, and scoring. We document the evaluation tasks, data, protocols, and results of recent GALE MT Evaluations. In addition, we present a range of different approaches for optimizing MT systems on the basis of different measures. We outline the requirements and specific problems when using different optimization approaches and describe how the characteristics of different MT metrics affect the optimization. Finally, we describe novel recent and ongoing work on the development of fully automatic MT evaluation metrics that have the potential to substantially improve the effectiveness of evaluation and optimization of MT systems.

  • Book Chapter
  • 10.1007/978-3-642-25661-5_62
A Naïve Automatic MT Evaluation Method without Reference Translations
  • Jan 1, 2011
  • Junjie Jiang + 2 more

Traditional automatic machine translation (MT) evaluation methods adopt the idea that calculates the similarity between machine translation output and human reference translations. However, in terms of the needs of many users, it is a key research issues to propose an evaluation method without references. As described in this paper, we propose a novel automatic MT evaluation method without human reference translations. Firstly, calculate average n-grams probability of source sentence with source language models, and similarly, calculate average n-grams probability of machine-translated sentence with target language models, finally, use the relative error of two average n-grams probabilities to mark machine-translated sentence. The experimental results show that our method can achieve high correlations with a few automatic MT evaluation metrics. The main contribution of this paper is that users can get MT evaluation reliability in the absence of reference translations, which greatly improving the utility of MT evaluation metrics.Keywordsmachine translation evaluationautomatic evaluationwithout reference translations

  • Video Transcripts
  • 10.48448/jp5f-0z79
Difficulty-Aware Machine Translation Evaluation
  • Aug 1, 2021
  • Underline Science Inc.
  • Lidia S Chao + 3 more

The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation.

  • Research Article
  • Cite Count Icon 6
  • 10.1109/tsa.2005.860770
Using Multiple Edit Distances to Automatically Grade Outputs From Machine Translation Systems
  • Mar 1, 2006
  • IEEE Transactions on Audio, Speech and Language Processing
  • Y Akiba + 5 more

This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called “g Rader based on Edit Distances (RED)” that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.

  • Research Article
  • 10.15640/jflcc.v8n2a2
Editing Taiwan divination Verses with controlled Language Strategies: Machine-Translation-Mediated Effective Communication
  • Jan 1, 2020
  • Journal of Foreign Languages, Cultures and Civilizations
  • Chung-Ling Shih

Editing Taiwan divination Verses with controlled Language Strategies: Machine-Translation-Mediated Effective Communication Chung-ling Shih Abstract Aimed at fostering machine-translation-mediated communication across languages and cultures, this paper proposes editing Taiwan divination verses from the natural into controlled language to improve the comprehensibility of machine translation (MT) outputs. After editing 160 divination verses, and evaluating the semantic and grammatical accuracy of their English MTs that are produced by online Google Translate (a neural MT system for free), the author has identified several controlled language strategies. The lexical strategies include replacement of archaic Chinese words with vernacular Chinese ones, paraphrasing of culture references and insertion of explanations for metaphors. Grammatical strategies are the use of articles, determiners and possessive cases, and syntactical ones include the addition of conjunctions and the restoration of missing subjects or/and objects. The MT outputs of edited and unedited divination verses are compared. The findings show that the English MT outputs of edited texts have greatly improved their semantic, grammatical and syntactic accuracy, so the effectiveness of controlled language strategies is justified. Due to the effectiveness of editing verses with controlled language strategies, the goal of MT-enabled web-based communication across cultures is achieved. The practical significance is also discussed including culture acquisition and cost reduction. Full Text: PDF DOI: 10.15640/jflcc.v8n2a2

  • Research Article
  • 10.4312/slo2.0.2013.1.111-133
O avtomatski evalvaciji strojnega prevajanja
  • Dec 1, 2013
  • Slovenščina 2.0: empirical, applied and interdisciplinary research
  • Darinka Verdonik + 1 more

Stalen del razvoja strojnega prevajanja je evalvacija prevodov, pri čemer se v glavnem uporabljajo avtomatski postopki. Ti vedno temeljijo na referenčnem prevodu. V tem prispevku pokažemo, kako zelo različni so lahko referenčni prevodi za področje podnaslavljanja ter kako lahko to vpliva na oceno – ista metrika lahko isti prevajalnik oceni kot neuporaben ali kot zelo uspešen samo na podlagi tega, da uporabimo referenčne prevode, ki so pridobljeni po različnih postopkih, vendar vedno jezikovno in pomensko povsem ustrezni.

  • Research Article
  • Cite Count Icon 2
  • 10.5167/uzh-19086
Comparative evaluation of the linguistic output of MT systems for translation and information purposes
  • Sep 17, 2001
  • Zurich Open Repository and Archive (University of Zurich)
  • Elia Yuste + 1 more

This paper describes a Machine Translation (MT) evaluation experiment where emphasis is placed on the quality of output and the extent to which it is geared to different users' needs. Adopting a very specific scenario, that of a multilingual international organisation, a clear distinction is made between two user classes: translators and administrators. Whereas the first group requires MT output to be accurate and of good post-editable quality in order to produce a polished translation, the second group primarily needs informative data for carrying out other, non-linguistic tasks, and therefore uses MT more as an information-gathering and gisting tool. During the experiment, MT output of three different systems is compared in order to establish which MT system best serves the organisation's multilingual communication and information needs. This is a comparative usability- and adequacy-oriented evaluation in that it attempts to help such organisations decide which system produces the most adequate output for certain well-defined user types. To perform the experiment, criteria relating to both users and MT output are examined with reference to the ISLE taxonomy. The experiment comprises two evaluation phases, the first at sentence level, the second at overall text level. In both phases, evaluators make use of a 1-5 rating scale. Weighted results provide some insight into the systems' usability and adequacy for the purposes described above. As a conclusion, it i s suggested that further research should be devoted to the most critical aspect of this exercise, namely defining meaningful and useful criteria for evaluating the post-editability and informativeness of MT output.

  • Research Article
  • 10.3390/info16110965
Bringing Context into MT Evaluation: Translator Training Insights from the Classroom
  • Nov 7, 2025
  • Information
  • Sheila Castilho

The role of technology in translator training has become increasingly significant as machine translation (MT) evolves at a rapid pace. Beyond practical tool usage, training must now prepare students to engage critically with MT outputs and understand the socio-technical dimensions of translation. Traditional sentence-level MT evaluation, often conducted on isolated segments, can overlook discourse-level errors or produce misleadingly high scores for sentences that appear correct in isolation but are inaccurate within the broader discourse. Document-level MT evaluation has emerged as an approach that offers a more accurate perspective by accounting for context. This article presents the integration of context-aware MT evaluation into an MA-level translation module, in which students conducted a structured exercise comparing sentence-level and document-level methodologies, supported by reflective reporting. The aim was to familiarise students with context-aware evaluation techniques, expose the limitations of single-sentence evaluation, and foster a more nuanced understanding of translation quality. This study provides methodological insights for incorporating MT evaluation training into translator education and highlights how such exercises can develop critical awareness of MT’s contextual limitations. It also offers a framework for supporting students in building the analytical skills needed to evaluate MT output in professional and research settings.

  • Conference Article
  • Cite Count Icon 213
  • 10.3115/1626431.1626480
Fluency, adequacy, or HTER?
  • Jan 1, 2009
  • Matthew Snover + 3 more

Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the correlation of the scores they assign to MT output with human judgments of translation performance. Different types of human judgments, such as Fluency, Adequacy, and HTER, measure varying aspects of MT performance that can be captured by automatic MT metrics. We explore these differences through the use of a new tunable MT metric: TER-Plus, which extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases. TER-Plus was shown to be one of the top metrics in NIST's Metrics MATR 2008 Challenge, having the highest average rank in terms of Pearson and Spearman correlation. Optimizing TER-Plus to different types of human judgments yields significantly improved correlations and meaningful changes in the weight of different types of edits, demonstrating significant differences between the types of human judgments.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-319-49397-8_12
Identification of Relevant and Redundant Automatic Metrics for MT Evaluation
  • Jan 1, 2016
  • Michal Munk + 2 more

The paper is aimed at automatic metrics for translation quality assessment (TQA), specifically at machine translation (MT) output and the metrics for the evaluation of MT output (Precision, Recall, F-measure, BLEU, PER, WER and CDER). We examine their reliability and we determine the metrics which show decreasing reliability of the automatic evaluation of MT output. Besides the traditional measures (Cronbach’s alpha and standardized alpha) we use entropy for assessing the reliability of the automatic metrics of MT output. The results were obtained on a dataset covering translation from a low resource language (SK) into English (EN). The main contribution consists of the identification of the redundant automatic MT evaluation metrics.

  • Supplementary Content
  • Cite Count Icon 3
  • 10.6092/unibo/amsdottorato/9191
Machine translation for institutional academic texts: Output quality, terminology translation and post-editor trust
  • Mar 30, 2020
  • AMS Dottorato Institutional Doctoral Theses Repository (University of Bologna)
  • Randy Scansani

The present work is a feasibility study on the application of Machine Translation (MT) to institutional academic texts, specifically course catalogues, for Italian-English and German-English. The first research question of this work focuses on the feasibility of profitably applying MT to such texts. Since the benefits of a good quality MT might be counteracted by preconceptions of translators towards the output, the second research question examines translator trainees' trust towards an MT output as compared to a human translation (HT). Training and test sets are created for both language combinations in the institutional academic domain. MT systems used are ModernMT and Google Translate. Overall evaluations of the output quality are carried out using automatic metrics. Results show that applying neural MT to institutional academic texts can be beneficial even when bilingual data are not available. When small amounts of sentence pairs become available, MT quality improves. Then, a gold standard data set with manual annotations of terminology (MAGMATic) is created and used for an evaluation of the output focused on terminology translation. The gold standard was publicly released to stimulate research on terminology assessment. The assessment proves that domain-adaptation improves the quality of term translation. To conclude, a method to measure trust in a post-editing task is proposed and results regarding translator trainees trust towards MT are outlined. All participants are asked to work on the same text. Half of them is told that it is an MT output to be post-edited, and the other half that it is a HT needing revision. Results prove that there is no statistically significant difference between post-editing and HT revision in terms of number of edits and temporal effort. Results thus suggest that a new generation of translators that received training on MT and post-editing is not influenced by preconceptions against MT.

Save Icon
Up Arrow
Open/Close