An in-depth analysis of the individual impact of controlled language rules on machine translation output: a mixed-methods approach
Examining the general impact of Controlled Language (CL) rules in the context of Machine Translation (MT) has been an area of research for many years. The present study focuses on the following question: how do CL rules impact MT output individually? By analysing a German corpus-based test suite of technical texts that have been translated into English by different MT systems, this study endeavours to answer this question at different levels: the general impact of CL rules (rule- and system-independent), their impact at rule level (system-independent) as well as at rule and system level. The results of five MT systems are analysed and contrasted: a rule-based system, a statistical system, two differently constructed hybrid systems, and a neural system. For this, a mixed-methods triangulation approach that includes error annotation, human evaluation, and automatic evaluation was applied. The data was analysed both qualitatively and quantitatively in terms of CL influence on the following parameters: number and type of MT errors, style and content quality, and scores of two automatic evaluation metrics. In line with many studies, the results show a general positive impact of the applied CL rules on the MT output. However, at rule level, only four rules proved to have positive effects on the aforementioned parameters; three rules had negative effects on the parameters; and two rules did not show any significant impact. At rule and system level, the rules affected the MT systems differently, as expected. Rules that had a positive impact on earlier MT approaches did not show the same impact on the neural MT approach. Furthermore, neural MT delivered distinctly better results than earlier MT approaches, namely the highest error-free, style and content quality rates both before and after applying the rules, which indicates that neural MT offers a promising solution that no longer requires CL rules for improving the MT output.
- Research Article
27
- 10.1007/s10590-019-09233-w
- May 31, 2019
- Machine Translation
Many studies have shown that the application of controlled languages (CL) is an effective pre-editing technique to improve machine translation (MT) output. In this paper, we investigate whether this also holds true for neural machine translation (NMT). We compare the impact of applying nine CL rules on the quality of NMT output as opposed to that of rule-based, statistical, and hybrid MT by applying three methods: error annotation, human evaluation, and automatic evaluation. The analyzed data is a German corpus-based test suite of technical texts that have been translated into English by five MT systems (a neural, a rule-based, a statistical, and two hybrid MT systems). The comparison is conducted in terms of several quantitative parameters (number of errors, error types, quality ratings, and automatic evaluation metrics scores). The results show that CL rules positively affect rule-based, statistical, and hybrid MT systems. However, CL does not improve the results of the NMT system. The output of the neural system is mostly error-free both before and after CL application and has the highest quality in both scenarios among the analyzed MT systems showing a decrease in quality after applying the CL rules. The qualitative discussion of the NMT output sheds light on the problems that CL causes for this kind of MT architecture.
- Dissertation
2
- 10.23889/suthesis.57439
- Jul 22, 2021
In general, advances in translation technology tools have enhanced translation quality significantly. Unfortunately, however, it seems that this is not the case for all language pairs. A concern arises when the users of translation tools want to work between different language families such as Arabic and English. The main problems facing Arabic<>English translation tools lie in Arabic’s characteristic free word order, richness of word inflection – including orthographic ambiguity – and optionality of diacritics, in addition to a lack of data resources. The aim of this study is to compare the performance of translation memory (TM) and machine translation (MT) systems in translating between Arabic and English.The research evaluates the two systems based on specific criteria relating to needs and expected results. The first part of the thesis evaluates the performance of a set of well-known TM systems when retrieving a segment of text that includes an Arabic linguistic feature. As it is widely known that TM matching metrics are based solely on the use of edit distance string measurements, it was expected that the aforementioned issues would lead to a low match percentage. The second part of the thesis evaluates multiple MT systems that use the mainstream neural machine translation (NMT) approach to translation quality. Due to a lack of training data resources and its rich morphology, it was anticipated that Arabic features would reduce the translation quality of this corpus-based approach. The systems’ output was evaluated using both automatic evaluation metrics including BLEU and hLEPOR, and TAUS human quality ranking criteria for adequacy and fluency.The study employed a black-box testing methodology to experimentally examine the TM systems through a test suite instrument and also to translate Arabic English sentences to collect the MT systems’ output. A translation threshold was used to evaluate the fuzzy matches of TM systems, while an online survey was used to collect participants’ responses to the quality of MT system’s output. The experiments’ input of both systems was extracted from Arabic<>English corpora, which was examined by means of quantitative data analysis. The results show that, when retrieving translations, the current TM matching metrics are unable to recognise Arabic features and score them appropriately. In terms of automatic translation, MT produced good results for adequacy, especially when translating from Arabic to English, but the systems’ output appeared to need post-editing for fluency. Moreover, when retrievingfrom Arabic, it was found that short sentences were handled much better by MT than by TM. The findings may be given as recommendations to software developers.
- Conference Article
1
- 10.5339/qfarc.2018.ictpd885
- Jan 1, 2018
Toward a Cognitive Evaluation Approach for Machine Translation PostEditing
- Research Article
21
- 10.1109/taslp.2022.3161160
- Jan 1, 2022
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Machine translation (MT) outputs are widely scored using automatic evaluation metrics and human evaluation scores. The automatic evaluation metrics are expected to be easily computable and a reflection of human evaluation. Traditional string-based metrics such as BLEU, ChrF++ scores, are widely used to evaluate MT systems, but fail to account for synonyms that appear in the state-of-the-art neural machine translation (NMT) systems, owing to their inability to evaluate paraphrases. While similarity-based metrics such as Yisi, BERTScore address this issue, these metrics need to be modified to better evaluate morphologically rich Indian languages such as, Tamil and Hindi. The current work proposes a novel and individual sentence-BERT based similarity (SBSim) metric, that makes use of a paraphrase-BERT model and sentence-level embedding to evaluate NMT outputs. The effectiveness of the BLEU, ChrF++, Yisi, BERTScore, and the proposed SBSim are evaluated on English-to-Tamil and English-to-Hindi NMT outputs. The sentence-level metric correlation of the proposed SBSim metric with respect to human scores is observed to outperform the existing metrics with a correlation of 0.9123 and 0.9052 for English-to-Tamil and English-to-Hindi NMT systems, respectively. Further, the average metric correlation of the SBSim metric is also observed to be the highest with a value of 0.9801 and 0.9836 for these NMT systems, respectively. The proposed metric is also evaluated on WMT2020 dataset and reports the highest correlation of 0.7129 with the human scores.
- Research Article
81
- 10.3390/math11041006
- Feb 16, 2023
- Mathematics
The success of Transformer architecture has seen increased interest in machine translation (MT). The translation quality of neural network-based MT transcends that of translations derived using statistical methods. This growth in MT research has entailed the development of accurate automatic evaluation metrics that allow us to track the performance of MT. However, automatically evaluating and comparing MT systems is a challenging task. Several studies have shown that traditional metrics (e.g., BLEU, TER) show poor performance in capturing semantic similarity between MT outputs and human reference translations. To date, to improve performance, various evaluation metrics have been proposed using the Transformer architecture. However, a systematic and comprehensive literature review on these metrics is still missing. Therefore, it is necessary to survey the existing automatic evaluation metrics of MT to enable both established and new researchers to quickly understand the trend of MT evaluation over the past few years. In this survey, we present the trend of automatic evaluation metrics. To better understand the developments in the field, we provide the taxonomy of the automatic evaluation metrics. Then, we explain the key contributions and shortcomings of the metrics. In addition, we select the representative metrics from the taxonomy, and conduct experiments to analyze related problems. Finally, we discuss the limitation of the current automatic metric studies through the experimentation and our suggestions for further research to improve the automatic evaluation metrics.
- Research Article
1
- 10.1016/j.mex.2025.103613
- Sep 8, 2025
- MethodsX
Machine Translation (MT) evaluation plays a crucial role in advancing systems translating into morphologically rich, low-resource languages such as Slovak. Existing automatic evaluation methods typically offer a single quality score, lacking insight into specific error types. A novel linguistically informed methodology that predicts the probability of MT error categories by integrating manual annotation with automatic evaluation metrics is proposed. The method builds on a modified MQM framework adapted for Slovak and employs a dataset of English-to-Slovak translations, combining outputs from statistical and neural MT systems with human reference translations. Manual annotations identified five linguistically motivated error categories. Reliability of 68 automatic metrics was assessed using Cronbach’s alpha, correlation coefficients, coefficient of determination (R²), and entropy. Bootstrapped logistic regression models were then developed to predict error occurrence probabilities. The proposed methodology improves the explainability and reliability of automatic MT evaluation by bridging the gap between holistic scoring and detailed error categorization. It significantly reduces the human effort required for quality assessment while maintaining a high degree of linguistic relevance, particularly for complex target languages like Slovak.•Predicts probabilities of specific MT error categories•Integrates linguistic expertise with statistical reliability analysis•Reduces human effort in MT evaluation while preserving linguistic precision
- Research Article
48
- 10.1007/s10590-018-9214-x
- Feb 10, 2018
- Machine Translation
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54\%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.
- Research Article
1
- 10.14198/elua.21900
- Jul 19, 2022
- ELUA
With the active participation of users in product review platforms, online consumer-generated content, and, more specifically, user-generated reviews, have become a clear reference in purchasing decision-making processes, which sometimes exceed the impact of advertising campaigns. A common feature of most tourism review platforms is the use of machine translation (MT) systems to immediately make reviews available to users in various languages. However, the quality of the MT output of these reviews varies greatly, primarily due to the subjective and unstructured nature of this digital genre. Different studies confirm that there are no universal quality rating scales. The assessment of MT output quality usually depends on factors such as the purpose of the text or the value given to the immediacy of the translation. New neural MT systems have been a revolution in the quality increase of the translated output; however, new lines of research are opening up to verify whether the quality of this new paradigm of MT can be assessed with the existing scales, mainly from previous rule-based systems and statistical translation, or whether it is necessary to develop new quality metrics specifically for these new intelligent systems. On the other hand, one of the questions that remain to be resolved in this new context of neural MT is whether the use of large amounts of textual data in the training of these systems is as effective as the use of less data but of higher quality and better-adjusted to the specialty and type of text for which it is used. Based on the hypothesis that each genre requires specific quality rating scales, this work identifies the error patterns and textual characteristics of online user reviews from a corpus-based approach analysis that will contribute to adapting quality rating scales to this specific digital genre.
- Research Article
28
- 10.1007/s10590-020-09251-z
- Aug 19, 2020
- Machine Translation
Terminology translation plays a critical role in domain-specific machine translation (MT). Phrase-based statistical MT (PB-SMT) has been the dominant approach to MT for the past 30 years, both in academia and industry. Neural MT (NMT), an end-to-end learning approach to MT, is steadily taking the place of PB-SMT. In this paper, we conduct comparative qualitative evaluation and comprehensive error analysis on terminology translation in PB-SMT and NMT in two translation directions: English-to-Hindi and Hindi-to-English. To the best of our knowledge, there is no gold standard available for evaluating terminology translation quality in MT. For this reason we select an evaluation test set from a legal domain corpus and create a gold standard for evaluating terminology translation in MT. We also propose an error typology taking the terminology translation errors in MT into consideration. We translate sentences of the test set with our MT systems and terminology translations are manually classified as per the error typology. We evaluate the MT system’s performance on terminology translation, and demonstrate our findings, unraveling strengths, weaknesses, and similarities of PB-SMT and NMT in the area of term translation.
- Research Article
108
- 10.1016/j.jbi.2018.07.018
- Jul 19, 2018
- Journal of Biomedical Informatics
Development of machine translation technology for assisting health communication: A systematic review.
- Supplementary Content
3
- 10.6092/unibo/amsdottorato/9191
- Mar 30, 2020
- AMS Dottorato Institutional Doctoral Theses Repository (University of Bologna)
The present work is a feasibility study on the application of Machine Translation (MT) to institutional academic texts, specifically course catalogues, for Italian-English and German-English. The first research question of this work focuses on the feasibility of profitably applying MT to such texts. Since the benefits of a good quality MT might be counteracted by preconceptions of translators towards the output, the second research question examines translator trainees' trust towards an MT output as compared to a human translation (HT). Training and test sets are created for both language combinations in the institutional academic domain. MT systems used are ModernMT and Google Translate. Overall evaluations of the output quality are carried out using automatic metrics. Results show that applying neural MT to institutional academic texts can be beneficial even when bilingual data are not available. When small amounts of sentence pairs become available, MT quality improves. Then, a gold standard data set with manual annotations of terminology (MAGMATic) is created and used for an evaluation of the output focused on terminology translation. The gold standard was publicly released to stimulate research on terminology assessment. The assessment proves that domain-adaptation improves the quality of term translation. To conclude, a method to measure trust in a post-editing task is proposed and results regarding translator trainees trust towards MT are outlined. All participants are asked to work on the same text. Half of them is told that it is an MT output to be post-edited, and the other half that it is a HT needing revision. Results prove that there is no statistically significant difference between post-editing and HT revision in terms of number of edits and temporal effort. Results thus suggest that a new generation of translators that received training on MT and post-editing is not influenced by preconceptions against MT.
- Research Article
- 10.15640/jflcc.v8n2a2
- Jan 1, 2020
- Journal of Foreign Languages, Cultures and Civilizations
Editing Taiwan divination Verses with controlled Language Strategies: Machine-Translation-Mediated Effective Communication Chung-ling Shih Abstract Aimed at fostering machine-translation-mediated communication across languages and cultures, this paper proposes editing Taiwan divination verses from the natural into controlled language to improve the comprehensibility of machine translation (MT) outputs. After editing 160 divination verses, and evaluating the semantic and grammatical accuracy of their English MTs that are produced by online Google Translate (a neural MT system for free), the author has identified several controlled language strategies. The lexical strategies include replacement of archaic Chinese words with vernacular Chinese ones, paraphrasing of culture references and insertion of explanations for metaphors. Grammatical strategies are the use of articles, determiners and possessive cases, and syntactical ones include the addition of conjunctions and the restoration of missing subjects or/and objects. The MT outputs of edited and unedited divination verses are compared. The findings show that the English MT outputs of edited texts have greatly improved their semantic, grammatical and syntactic accuracy, so the effectiveness of controlled language strategies is justified. Due to the effectiveness of editing verses with controlled language strategies, the goal of MT-enabled web-based communication across cultures is achieved. The practical significance is also discussed including culture acquisition and cost reduction. Full Text: PDF DOI: 10.15640/jflcc.v8n2a2
- Research Article
9
- 10.1016/j.csl.2018.10.005
- Nov 8, 2018
- Computer Speech & Language
Estimating post-editing time using a gold-standard set of machine translation errors
- Research Article
2
- 10.3390/ijerph18189873
- Sep 19, 2021
- International Journal of Environmental Research and Public Health
Background: Machine translation (MT) technologies have increasing applications in healthcare. Despite their convenience, cost-effectiveness, and constantly improved accuracy, research shows that the use of MT tools in medical or healthcare settings poses risks to vulnerable populations. Objectives: We aimed to develop machine learning classifiers (MNB and RVM) to forecast nuanced yet significant MT errors of clinical symptoms in Chinese neural MT outputs. Methods: We screened human translations of MSD Manuals for information on self-diagnosis of infectious diseases and produced their matching neural MT outputs for subsequent pairwise quality assessment by trained bilingual health researchers. Different feature optimisation and normalisation techniques were used to identify the best feature set. Results: The RVM classifier using optimised, normalised (L2 normalisation) semantic features achieved the highest sensitivity, specificity, AUC, and accuracy. MNB achieved similar high performance using the same optimised semantic feature set. The best probability threshold of the best performing RVM classifier was found at 0.6, with a very high positive likelihood ratio (LR+) of 27.82 (95% CI: 3.99, 193.76), and a low negative likelihood ratio (LR−) of 0.19 (95% CI: 0.08, 046), suggesting the high diagnostic utility of our model to predict the probabilities of erroneous MT of disease symptoms to help reverse potential inaccurate self-diagnosis of diseases among vulnerable people without adequate medical knowledge or an ability to ascertain the reliability of MT outputs. Conclusion: Our study demonstrated the viability, flexibility, and efficiency of introducing machine learning models to help promote risk-aware use of MT technologies to achieve optimal, safer digital health outcomes for vulnerable people.
- Conference Article
4
- 10.24963/ijcai.2018/789
- Jul 1, 2018
In the last years, deep learning algorithms have highly revolutionized several areas including speech, image and natural language processing. The specific field of Machine Translation (MT) has not remained invariant. Integration of deep learning in MT varies from re-modeling existing features into standard statistical systems to the development of a new architecture. Among the different neural networks, research works use feed-forward neural networks, recurrent neural networks and the encoder-decoder schema. These architectures are able to tackle challenges as having low-resources or morphology variations. This extended abstract focuses on describing the foundational works on the neural MT approach; mentioning its strengths and weaknesses; and including an analysis of the corresponding challenges and future work. The full manuscript [Costa-jussà, 2018] describes, in addition, how these neural networks have been integrated to enhance different aspects and models from statistical MT, including language modeling, word alignment, translation, reordering, and rescoring; and on describing the new neural MT approach together with recent approaches on using subword, characters and training with multilingual languages, among others.