Abstract

Stop words elimination is important pre-processing step in Natural Language Processing (NLP) and text mining applications. Stop words removal improves the performance and quality of classifications system. In the context of classification task it is possible to reduce number of dimensions in the term of space by removing most common words which has less significant meaning and irrelevant. But it does not mean stop words removal can improve the performance of all types of applications in the area of NLP, Artificial Intelligence, Text Mining and Machine Translation. In context of Machine Translation (MT) stop words elimination process will lead to loss of accuracy, because each token has specific meaning which will be converted into target language. As on date there is no unique stop words list is available for Gujarati language with its lexical classes (Part-of-Speech Tags) to improve the performance of MT system. This paper present construction and categorization of stop words list for Gujarati language based on its lexical classes (nouns, verbs, adjectives, adverbs, etc.) of Part-of-Speech family. We have prepared 126 raw text documents written in Gujarati language in which each document contained more than 260 tokens. After tokenization process, we got list of 32840 tokens. From the total number of tokens we created list of 1125 unique stop words with its lexical classes by manual inspection and help of linguistic experts. The stop words list and specifically categorization thereof is released herewith for NLP applications, particularly MT systems, in Gujarati language by the research community.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call