Abstract

Bilingual lexicons and parallel phrases have a great effect on certain tasks of natural language processing (NLP). Recent researches have proved that the high-quality bilingual lexicons can hence the performance of the machine translation. When it comes to some special tasks of NLP, the incorporation of bilingual lexicons can bring about obvious effectiveness. The bilingual lexicons and parallel phrases can be easily extracted from parallel corpora, but in contrast to the monolingual corpora, the number of parallel corpora is still scarce. Actually, the monolingual corpora also have the potential to mine a large amount of parallel word and phrase pairs. In this paper, we propose two strategies to extract parallel words and phrases from monolingual corpora. On one hand, we present the indirect mining strategy, Anchored Mining (AM), which injects the anchoring point into each mining procedure to improve the accuracy. On the other hand, inspired by the process of humans learning a foreign language, we further propose another novel, direct algorithm named Bootstrapping Mining (BM), which mimics the human learning process and aims to learn parallel phrases automatically in a self-iterative way. Additionally, we propose a novel metric, phrase probability-sub item average probability (PP-SAP), which is applied to quantitatively evaluate the rationality of each extracted parallel phrase pair in the monolingual corpora. We conduct the experiments on large-scale English-Chinese, English-Russia, and English-France monolingual corpora, and the results show that our methods can mine high-quality bilingual lexicons and parallel phrases. We also evaluate our algorithms on low-resource monolingual corpora and get good results as well.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call