Automatic stopword Identification Technique for Gujarati text

Dhara J Ladani,Nikita P Desai

doi:10.1109/aimv53313.2021.9670968

Abstract

Natural Language Processing (NLP) is an Artificially Intelligent (AI) mechanism that allows computers to intelligently analyze, comprehend, and derive meaning from human language. In natural language text processing, common words like ‘a’, ‘the’, ‘is’, ‘an’, etc. are known as a stopwords. They are typically considered having no informative value. It is proved that one of the major benefits of removing stopword in NLP text-based processing is the reduction of the text in the corpus by 35 - 45%, without compromising on the efficiency of the target application performance. There are many stopword lists existing for Non-Indian languages like English, Arabic, French and German. Even for a few Indian languages like Hindi, Sanskrit and, Tamil substantial lists are available. But as of date very little research work is reported for one of the widely used Indian language namely Gujarati. As per our survey, for the Gujarati language, two major approaches have been suggested for stopword identification. The first approach is giving a static generic stopword list, and another approach is a Rule-based approach. The major drawback of these method is their inability to handle neologism. In this paper, we have suggested domain-specific, robust and dynamic stopword list identification mechanism developed for documents written in the Gujarati language. In our proposed approach, we take the top "N" words as seed words based on their frequency and later add other "M" similar context word which are identified by word embeddings. Further the effectiveness of removing these listed (N+M) stop words was checked by applying the stopword removal preprocessing phase in the Text Classification (TC) and Information Retrieval (IR) applications. In TC model, the feature vector reduces by approximately 16%, and on other hand, the accuracy of the TC model increased by nearly 3 %. The experiments also found, removal of the these stop words in IR application, increased the Mean Average Precision (MAP) of the system by nearly 31%. Thus, the overall time and space requirements were decreased without compromising on the end results of system.

Full Text