Abstract

The problem of domain adaptation in statistical machine translation systems emanates from the fundamental assumption that test and training data are drawn independently from the same distribution (topic, domain, genre, style etc.). In real-life translation tasks, the sparseness of in-domain parallel training data often leads to poor model estimation, and consequentially poor translation quality. Domain adaptation by supplementary data selection aims at addressing this specific issue by selecting relevant parallel training data from out-of-domain or general-domain bi-text to enhance the quality of a poor baseline system. State-of-the-art research in data selection focuses on the development of novel similarity measures to improve the relevance of selected data. However, in this paper we approach the problem from a different perspective. In contrast to the conventional approach of using the entire available target-domain data as a reference for supplementary data selection, we restrict the reference set to only those sentences that are expected to be poorly translated by the baseline MT system using a Quality Estimation model. Our rationale is to focus help (i.e. supplementary training material) to where it is needed most. Automatic quality estimation techniques are used to identify such poorly translated sentences in the target domain. The experiments reported in this paper show that (i) this technique provides statistically significant improvements over the unadapted baseline translation and (ii) using significantly smaller amounts of supplementary data our approach achieves results comparable to state-of-the-art approaches using conventional reference sets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.