Abstract
The stemming process is crucial and significant in the pre-processing step of natural language processing. The stemmer oversees the stemming process. It facilitates the extraction of morphological variants of a root or base word from the provided word. Over the period, several stemmers for various vernacular languages have been proposed. However, very few research studies have comprehensively investigated these available stemmers. This article makes multifold contributions. First, we discuss the various stemmers of 15 Indian and 17 non-Indian languages describing their key points, benefits, and drawbacks. All the Indian languages for which stemmers have been built are covered in this study. For the non-Indian languages, stemmers of commonly spoken languages have been covered. Second, we present a language-wise comparative analysis of stemmers based on our identified parameters. Third, we discuss the wordnets and dictionaries available for different languages. Fourth, we provide details of the datasets available for various languages. Fifth, we also provide challenges in existing stemmers and future directions for future researchers. The study presented in this article reveals that significant research has been carried out for the stemmers of influential languages such as English, Arabic, and Urdu. On the other hand, languages with d resources, such as Farsi, Polish, Odia, Amharic, and others, have received the least attention for research. Moreover, rigorous analysis reveals that most of the stemmers suffer from over-stemming errors. With a complete catalogue of available stemmers, this study aims at assisting the researchers and professionals working in the areas such as information retrieval, semantic annotation, word meaning disambiguation, and ontology learning.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.