Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

Bin Li,Jianmin Yao

doi:10.1155/2020/8879570

Abstract

The performance of a machine translation system (MTS) depends on the quality and size of the training data. How to extend the training dataset for the MTS in specific domains with effective methods to enhance the performance of machine translation needs to be explored. A method for selecting in-domain bilingual sentence pairs based on the topic information is proposed. With the aid of the topic relevance of the bilingual sentence pairs to the target domain, subsets of sentence pairs related to the texts to be translated are selected from a large-scale bilingual corpus to train the translation system in specific domains to improve the translation quality for in-domain texts. Through the test, the bilingual sentence pairs are selected by using the proposed method, and further the MTS is trained. In this way, the translation performance is greatly enhanced.

Highlights

At present, the performance of a machine translation system (MTS) is determined by the quality and size of the training data. e larger the size and the higher the quality of training data are, the superior the translation performance is
Limited by the size of monolingual or bilingual resources in a target domain, the method is likely to result in data sparseness; the topic diversity of in-domain texts is ignored when training a translation model or language model with all dataset [13, 14]
With the aid of the topic relevance of texts, the bilingual sentence pairs relevant to the target domain are selected, which provides a new method for extending the training data for specific MTSs and solves the problem incurred by the lack of training data in specific fields

Summary

Introduction

The performance of a machine translation system (MTS) is determined by the quality and size of the training data. e larger the size and the higher the quality of training data are, the superior the translation performance is. The performance of a machine translation system (MTS) is determined by the quality and size of the training data. E larger the size and the higher the quality of training data are, the superior the translation performance is. When the training corpora and the test texts are subordinated to different domains, a translation system generally presents poor performance. It is expected to extend the training dataset for an MTS in specific domains to enhance the performance of the machine translation. The bilingual parallel sentence pairs acquired by using existing methods for mining bilingual resources generally do not contain corresponding labels indicating domains. Us, determining how to automatically mine bilingual sentence pairs relevant to a specific domain from the bilingual resources becomes an effective approach to improve the performance of machine translation The bilingual parallel sentence pairs acquired by using existing methods for mining bilingual resources generally do not contain corresponding labels indicating domains. us, determining how to automatically mine bilingual sentence pairs relevant to a specific domain from the bilingual resources becomes an effective approach to improve the performance of machine translation

Related Work

Corpora and Arrangements during the Test

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Journal: Scientific Programming	Publication Date: Dec 15, 2020
License type: CC BY 4.0

Similar Papers

Using Statistical Machine Translation to Grade Training Data
Andrew Finch ... Eiichiro Sumita
-
Andrew Finch, et. al.Andrew Finch ... Eiichiro Sumita
01 Dec 2008
01 Dec 2008

Research and Implementation on Machine Translation System with Online Corpora Extraction Technology
Lin Chirong
-
Lin ChirongLin Chirong
01 Jun 2014
01 Jun 2014

Hybrid Combination of Machine Translation with Part-of-Speech Analysis

Applied Mechanics and Materials | VOL. 416-417

01 Sep 2013
Applied Mechanics and Materials | VOL. 416-417

An Evaluation of the Accuracy of Online Translation Systems
Milam Aiken ... Kaushik Ghosh
Communications of the IIMA | VOL. 9
Milam Aiken, et. al.Milam Aiken ... Kaushik Ghosh
03 Jun 2014
Communications of the IIMA | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming