A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

Pascale Fung

doi:10.1007/3-540-49478-2_1

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words related to the correct translation. Since nonparallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, we conclude that the output from Convec is reasonable and useful.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A statistical view on bilingual lexicon extraction
Pascale Fung
-
Pascale FungPascale Fung
01 Jan 1999
01 Jan 1999

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus
...
Indian journal of science and technology | VOL. 7
, et. al. ...
20 Sep 2014
Indian journal of science and technology | VOL. 7

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus
Ebrahim Ansari
Indian Journal of Science and Technology | VOL. 7
Ebrahim AnsariEbrahim Ansari
20 Sep 2014
Indian Journal of Science and Technology | VOL. 7

Parallel Sentence Extraction Based on Unsupervised Bilingual Lexicon Extraction from Comparable Corpora
Chenhui Chu ... Sadao Kurohashi
Journal of Natural Language Processing | VOL. 22
Chenhui Chu, et. al.Chenhui Chu ... Sadao Kurohashi
01 Jan 2015
Journal of Natural Language Processing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

Abstract

Talk to us

Similar Papers