Stopword Identification and Removal Techniques on TC and IR applications: A Survey

Dhara J Ladani,Nikita P Desai

doi:10.1109/icaccs48705.2020.9074166

Abstract

The concept of “Stopword” was first introduced by H.P. Luhn in 1958. In Natural Language Processing (NLP), Stop word is a common word that is neither indexed nor searchable in a computer search engine. Example of stop words are `a', `the', `is' etc. Removing stopword is Pre-processing step in majority of NLP applications, including IR (Information Retrieval) and TC (Text Classification). Some of the benefits of removing stop word are - decrease in size of corpus by 35-45%, improvement of efficiency and accuracy of the text mining applications thus helping in reduction of time and space complexity of overall application. In this paper, we discuss the various major stopword identification techniques used by the researchers in last few decades, for Indian Language and Non-Indian Languages. Also, we present a survey of methods used for stopword list generation with their characteristics. We have also mentioned the effect of various stopword removal techniques applied on TC and IR application domains. A comprehensive list of resources publicly available for static stop words in various languages is also given for quick reference.

Full Text