Domain adaptation of statistical machine translation with domain-focused web crawling.

Pavel Pecina,Antonio Toral,Vassilis Papavassiliou,Josef Van Genabith,Andy Way,Aleš Tamchyna,Prokopis Prokopidis

doi:10.1007/s10579-014-9282-3

Abstract

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

Highlights

Recent advances in statistical machine translation (SMT) have improved machine translation (MT) quality to such an extent that it can be successfully used in industrial processes (e.g., Flournoy and Duran 2009)
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domainfocused crawling of text from the World Wide Web
From the analysis presented above, we conclude that a phrase-based SMT (PB-SMT) system tuned on data from the same domain as the training data strongly prefers to construct translations consisting of long phrases

Summary

Introduction

Recent advances in statistical machine translation (SMT) have improved machine translation (MT) quality to such an extent that it can be successfully used in industrial processes (e.g., Flournoy and Duran 2009) This mostly happens only in specific domains where ample training data is available (e.g., Wu et al 2008). Tuning with and for specific domains (while using generic training data) allows the MT system to stitch together translations from smaller fragments which, in this case, leads to improved translation quality. Such tuning requires only small development sets which can be harvested automatically from the web with minimal human intervention; no manual cleaning of the development data is necessary.

Web crawling for textual data

Web crawling for parallel texts

Phrase-based statistical machine translation

Domain adaptation in statistical machine translation

Domain-focused web crawling for monolingual and parallel data

Acquisition of monolingual texts

Acquisition of parallel texts

Extraction of parallel sentences

Manual correction of test sentence pairs

Baseline translation system

System description

General-domain data

Baseline system evaluation

Domain adaptation by parameter tuning

Correction of development data

Analysis of model parameters

Analysis of phrase-length distribution

Other alternatives to parameter optimisation

Analysis of learning curves

Language model adaptation

Translation model adaptation

Complete adaptation and result analysis

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Language Resources and Evaluation	Publication Date: Dec 3, 2014
Citations: 35	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Domain adaptation of statistical machine translation with domain-focused web crawling.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Language Resources and Evaluation

Lead the way for us

Similar Papers

Multilingual Neural Translation

-

14 Feb 2020
14 Feb 2020

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data
Bogdan Babych ... Gregor Thurmair
-
Bogdan Babych, et. al.Bogdan Babych ... Gregor Thurmair
01 Jan 2019
01 Jan 2019

Active learning for statistical phrase-based machine translation
Gholamreza Haffari ... Anoop Sarkar
-
Gholamreza Haffari, et. al.Gholamreza Haffari ... Anoop Sarkar
01 Jan 2009
01 Jan 2009

Domain adaptation for translation models in statistical machine translation

-

01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Domain adaptation of statistical machine translation with domain-focused web crawling.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Language Resources and Evaluation