Abstract

Abstract —Automatic extraction of collocations from a corpus is a well-known problem in the field of natural language processing. It is typically carried out by employing some kind of a statistical measure that indicates whether or not two words occur together more often than by chance. A fuzzy set theoretic approach for extracting collocations from a text collection is described in this article. This approach proposed a fuzzy bi-gram index to find the bi-grams from a collection. Collocations of higher length i.e., n-grams ( n>2 ) are then obtained using the fuzzy bi-gram index where the extracted collocations of lower lengths are treated as individual words. The performance of the proposed methods is found to be quite promising and it is better than that of other widely used methods we considered. Keywords— Collocation extraction, Fuzzy sets, Natural language processing, Corpus statistics, GENIA corpus. I. I NTRODUCTION A collocation is just a set of words occurring together more often than by chance in a corpus. Collocations are of high importance for many applications in the field of natural language processing. The most desirable ones are machine translation, word sense disambiguation, language generation, and information retrieval. Most methods for collocation extraction are based on verification of typical collocation properties. These properties are formally described by mathematical formulas that determine the degree of association between components of collocation. Such formulas are called

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.