Mining comparable bilingual text corpora for cross-language information integration

Tao Tao,Chengxiang Zhai

doi:10.1145/1081870.1081958

Abstract

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Mining comparable bilingual text corpora for cross-language information integration

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus
...
Indian journal of science and technology | VOL. 7
, et. al. ...
20 Sep 2014
Indian journal of science and technology | VOL. 7

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus
Ebrahim Ansari
Indian Journal of Science and Technology | VOL. 7
Ebrahim AnsariEbrahim Ansari
20 Sep 2014
Indian Journal of Science and Technology | VOL. 7

Incorporating Word Embedding into Cross-Lingual Topic Modeling
Chia-Hsuan Chang ... San-Yih Hwang
-
Chia-Hsuan Chang, et. al.Chia-Hsuan Chang ... San-Yih Hwang
01 Jul 2018
01 Jul 2018

Mining named entity transliteration equivalents from comparable corpora
Raghavendra Udupa ... Jagadeesh Jagarlamudi
-
Raghavendra Udupa, et. al.Raghavendra Udupa ... Jagadeesh Jagarlamudi
26 Oct 2008
26 Oct 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining comparable bilingual text corpora for cross-language information integration

Abstract

Talk to us

Similar Papers