Abstract

Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient corpus without using any language related rules. In this article, we proposed a fully unsupervised language-independent text stemming technique that clusters morphologically related words from the corpus of the language using both lexical and co-occurrence features such as lexical similarity, suffix knowledge, and co-occurrence similarity. The method applies to a wide range of inflectional languages as it identifies morphological variants formed through different linguistic processes such as affixation, compounding, conversion, etc.The proposed approach has been tested in Information Retrieval application for four languages (English, Marathi, Hungarian, and Bengali) using standard TREC, CLEF, and FIRE test collections. A significant improvement over word-based retrieval, five other corpus-based stemmers, and rule-based stemmers has been achieved in all the languages. Besides, information retrieval, the proposed approach has also been tested in text classification and inflection removal tasks. Our algorithm excelled over other baseline methods in all the test scenarios. Thus, we successfully achieved the objective of developing a multipurpose stemming algorithm that cannot only be used for information retrieval task but also for non-traditional tasks such as text classification, sentiment analysis, inflection removal, etc.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.