Abstract

The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremely. Traditional corpus collection methods include manual collection, text recognition of books and Internet crawlers, etc. Among them, the most efficient method to collect corpus is internet crawler preached by many. Traditional Internet crawler algorithm is interfere easily by a lot of spamming and advertising that lead to the time-consuming and low-precision. We propose a web crawler mechanism combines acquisition automatically technology bilingual website list, crawling corpus and cleaning corpus to obtain high quality parallel corpus. Firstly, using the hyperlinks to recursively access related corpus websites through building the website graph. Furthermore, the breadth-first, Backline and PageRank crawler framework used to build a corpus selection model based on crawling with threshold, matching link, ranking the heat of page, through this, the CMPC can be found accurately. Finally, the corpus cleaning model based on the HTML parsing to determine a set of standardized token sequences. By testing the Chinese-Myanmar reptile algorithm established in this paper, the experimental results show that our benchmarks this model exceeds previous published benchmarks. Up to now, we have obtained 1.1 million parallel corpus pairs of Chinese-Myanmar.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.