STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

Andoni Azpeitia,Thierry Etchegoyhen,Eva Martínez Garcia

doi:10.18653/v1/w18-6473

Abstract

We describe Vicomtech’s participation in the WMT 2018 Shared Task on parallel corpus filtering. We aimed to evaluate a simple approach to the task, which can efficiently process large volumes of data and can be easily deployed for new datasets in different language pairs and domains. We based our approach on STACC, an efficient and portable method for parallel sentence identification in comparable corpora. To address the specifics of the corpus filtering task, which features significant volumes of noisy data, the core method was expanded with a penalty based on the amount of unknown words in sentence pairs. Additionally, we experimented with a complementary data saturation method based on source sentence n-grams, with the goal of demoting parallel sentence pairs that do not contribute significant amounts of yet unobserved n-grams. Our approach requires no prior training and is highly efficient on the type of large datasets featured in the corpus filtering task. We achieved competitive results with this simple and portable method, ranking in the top half among competing systems overall.

Highlights

Data-driven approaches to Machine Translation (MT) have been the dominant paradigm in the last two decades, with the development of Statistical Machine Translation (SMT) (Brown et al, 1990), and, more recently, of Neural Machine Translation (NMT) (Bahdanau et al, 2015)
For the lexical translation tables needed by the STACC algorithm, we trained IBM2 models with the FASTALIGN toolkit (Dyer et al, 2013), on corpora made available for the WMT 2018 news translation task
The results of our approach on the WMT 2018 test sets are shown in Table 1.3 Overall, our primary submission, STACC.OOV performed well on the task, ranking in the top third for SMT 10M and NMT 10M, and as a mid-performing system in the other two scenarios

Summary

Introduction

Data-driven approaches to Machine Translation (MT) have been the dominant paradigm in the last two decades, with the development of Statistical Machine Translation (SMT) (Brown et al, 1990), and, more recently, of Neural Machine Translation (NMT) (Bahdanau et al, 2015). These approaches require large volumes of parallel sentences to properly model translation in a given language pair. Corpora created via crawling, with automated document and sentence alignment, tend to exhibit significant volumes of noisy data, which can be detrimental to the training of MT systems (Khadivi and Ney, 2005; Khayrallah and Koehn, 2018a). Xu and Koehn (2017) introduced Zipporah, a fast data selection system for noisy parallel corpora, which is shown to result in improved SMT system quality

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 18	License type: cc-by

Similar Papers

Set-Theoretic Alignment for Comparable Corpora
Thierry Etchegoyhen ... Andoni Azpeitia
-
Thierry Etchegoyhen, et. al.Thierry Etchegoyhen ... Andoni Azpeitia
01 Jan 2015
01 Jan 2015

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora
Chenhui Chu ... Sadao Kurohashi
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 15
Chenhui Chu, et. al.Chenhui Chu ... Sadao Kurohashi
11 Dec 2015
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 15

Doubts on the reliability of parallel corpus filtering
Hyeonseok Moon ... Heuiseok Lim
Expert Systems With Applications | VOL. 233
Hyeonseok Moon, et. al.Hyeonseok Moon ... Heuiseok Lim
11 Jul 2023
Expert Systems With Applications | VOL. 233

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora
Pierre Zweigenbaum ... Serge Sharoff
-
Pierre Zweigenbaum, et. al.Pierre Zweigenbaum ... Serge Sharoff
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

Abstract

Highlights

Summary

Talk to us

Similar Papers