Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus

Kai Xiong,Yanmei Jing,Huafu Li,Wenxue He,Rui Yuan,Yansheng Wang,Qiqi He

doi:10.1088/1757-899x/646/1/012046

Kai Xiong, Yanmei Jing + Show 5 more

Open Access

https://doi.org/10.1088/1757-899x/646/1/012046

Copy DOI

Abstract

The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremely. Traditional corpus collection methods include manual collection, text recognition of books and Internet crawlers, etc. Among them, the most efficient method to collect corpus is internet crawler preached by many. Traditional Internet crawler algorithm is interfere easily by a lot of spamming and advertising that lead to the time-consuming and low-precision. We propose a web crawler mechanism combines acquisition automatically technology bilingual website list, crawling corpus and cleaning corpus to obtain high quality parallel corpus. Firstly, using the hyperlinks to recursively access related corpus websites through building the website graph. Furthermore, the breadth-first, Backline and PageRank crawler framework used to build a corpus selection model based on crawling with threshold, matching link, ranking the heat of page, through this, the CMPC can be found accurately. Finally, the corpus cleaning model based on the HTML parsing to determine a set of standardized token sequences. By testing the Chinese-Myanmar reptile algorithm established in this paper, the experimental results show that our benchmarks this model exceeds previous published benchmarks. Up to now, we have obtained 1.1 million parallel corpus pairs of Chinese-Myanmar.

Full Text