Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Arne Defauw,Frederic Everaert,Kim Scholte,Joachim Van Den Bogaert,Koen Van Winckel,Joris Brabers,Roko Mijic,Tom Vanallemeersch,Anna Bardadym,Sara Szoc

doi:10.3390/informatics6030035

Abstract

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

Highlights

A machine translation (MT) system usually increases its performance when more training data is added
Our experiments showed that removing misalignments can be beneficial in terms of data selection, leaving misalignments in the training data did not result in a decrease in neural machine translation (NMT) performance
We performed an extrinsic evaluation by applying misalignment detection (MAD) on two web-scraped corpora, and examined the effect of removing misalignments on NMT performance

Summary

Introduction

A machine translation (MT) system usually increases its performance when more training data is added. Previous research showed that the performance of a neural machine translation (NMT) system decreases when the training data contains noisy sentence pairs [2,3], as an NMT model tends to assign high probabilities to rare events. Data crawled from the web typically contains a variety of noise: untranslated sentences, language and encoding errors, short segments, and misalignments. The effect of these types of noise on NMT and SMT was systematically investigated by Khayrallah and Koehn [4]. It was shown that alignment errors can impact SMT performance as well [5]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatics

Lead the way for us

Journal: Informatics	Publication Date: Sep 1, 2019
License type: CC BY 4.0

Similar Papers

Baidu Translate: Research and Products
Zhongjun He
-
Zhongjun HeZhongjun He
01 Jan 2015
01 Jan 2015

Real-Time Automatic Translation Algorithm for Chinese Subtitles in Media Playback Using Knowledge Base
Li Yan
-
Li YanLi Yan
18 Jun 2022
18 Jun 2022

English-Myanmar Supervised and Unsupervised NMT: NICT’s Machine Translation Systems at WAT-2019
Rui Wang ... Eiichiro Sumita
-
Rui Wang, et. al.Rui Wang ... Eiichiro Sumita
01 Jan 2019
English-Myanmar Supervised and Unsupervised NMT: NICT’s Machine Translation Systems at WAT-2019
Rui Wang ... Eiichiro Sumita

Iterative Training of Unsupervised Neural and Statistical Machine Translation Systems
Benjamin Marie ... Atsushi Fujita
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 19
Benjamin Marie, et. al.Benjamin Marie ... Atsushi Fujita
01 Jun 2020
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatics