Abstract

Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.