Abstract

The performance of a machine translation system (MTS) depends on the quality and size of the training data. How to extend the training dataset for the MTS in specific domains with effective methods to enhance the performance of machine translation needs to be explored. A method for selecting in-domain bilingual sentence pairs based on the topic information is proposed. With the aid of the topic relevance of the bilingual sentence pairs to the target domain, subsets of sentence pairs related to the texts to be translated are selected from a large-scale bilingual corpus to train the translation system in specific domains to improve the translation quality for in-domain texts. Through the test, the bilingual sentence pairs are selected by using the proposed method, and further the MTS is trained. In this way, the translation performance is greatly enhanced.

Highlights

  • At present, the performance of a machine translation system (MTS) is determined by the quality and size of the training data. e larger the size and the higher the quality of training data are, the superior the translation performance is

  • Limited by the size of monolingual or bilingual resources in a target domain, the method is likely to result in data sparseness; the topic diversity of in-domain texts is ignored when training a translation model or language model with all dataset [13, 14]

  • With the aid of the topic relevance of texts, the bilingual sentence pairs relevant to the target domain are selected, which provides a new method for extending the training data for specific MTSs and solves the problem incurred by the lack of training data in specific fields

Read more

Summary

Introduction

The performance of a machine translation system (MTS) is determined by the quality and size of the training data. e larger the size and the higher the quality of training data are, the superior the translation performance is. The performance of a machine translation system (MTS) is determined by the quality and size of the training data. E larger the size and the higher the quality of training data are, the superior the translation performance is. When the training corpora and the test texts are subordinated to different domains, a translation system generally presents poor performance. It is expected to extend the training dataset for an MTS in specific domains to enhance the performance of the machine translation. The bilingual parallel sentence pairs acquired by using existing methods for mining bilingual resources generally do not contain corresponding labels indicating domains. Us, determining how to automatically mine bilingual sentence pairs relevant to a specific domain from the bilingual resources becomes an effective approach to improve the performance of machine translation The bilingual parallel sentence pairs acquired by using existing methods for mining bilingual resources generally do not contain corresponding labels indicating domains. us, determining how to automatically mine bilingual sentence pairs relevant to a specific domain from the bilingual resources becomes an effective approach to improve the performance of machine translation

Related Work
Corpora and Arrangements during the Test
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.