Lexical classes based stop words categorization for Gujarati language

Rajnish M Rakholia,Jatinderkumar R Saini

doi:10.1109/icaccaf.2016.7749005

Abstract

Stop words elimination is important pre-processing step in Natural Language Processing (NLP) and text mining applications. Stop words removal improves the performance and quality of classifications system. In the context of classification task it is possible to reduce number of dimensions in the term of space by removing most common words which has less significant meaning and irrelevant. But it does not mean stop words removal can improve the performance of all types of applications in the area of NLP, Artificial Intelligence, Text Mining and Machine Translation. In context of Machine Translation (MT) stop words elimination process will lead to loss of accuracy, because each token has specific meaning which will be converted into target language. As on date there is no unique stop words list is available for Gujarati language with its lexical classes (Part-of-Speech Tags) to improve the performance of MT system. This paper present construction and categorization of stop words list for Gujarati language based on its lexical classes (nouns, verbs, adjectives, adverbs, etc.) of Part-of-Speech family. We have prepared 126 raw text documents written in Gujarati language in which each document contained more than 260 tokens. After tokenization process, we got list of 32840 tokens. From the total number of tokens we created list of 1125 unique stop words with its lexical classes by manual inspection and help of linguistic experts. The stop words list and specifically categorization thereof is released herewith for NLP applications, particularly MT systems, in Gujarati language by the research community.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lexical classes based stop words categorization for Gujarati language

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Automatic stopword Identification Technique for Gujarati text
Dhara J Ladani ... Nikita P Desai
-
Dhara J Ladani, et. al.Dhara J Ladani ... Nikita P Desai
24 Sep 2021
24 Sep 2021

A Rule-Based Approach to Identify Stop Words for Gujarati Language
Rajnish M Rakholia ... Jatinderkumar R Saini
-
Rajnish M Rakholia, et. al.Rajnish M Rakholia ... Jatinderkumar R Saini
01 Jan 2017
01 Jan 2017

Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018
Anis Nadiah Che Abdul Rahman ... Azhar Jaludin
GEMA Online® Journal of Language Studies | VOL. 21
Anis Nadiah Che Abdul Rahman, et. al.Anis Nadiah Che Abdul Rahman ... Azhar Jaludin
31 May 2021
Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018
Anis Nadiah Che Abdul Rahman ... Azhar Jaludin

Test model for stop word removal of devnagari text documents based on finite automata
Anjusha Pimpalshende ... A.R Mahajan
-
Anjusha Pimpalshende, et. al.Anjusha Pimpalshende ... A.R Mahajan
01 Sep 2017
01 Sep 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lexical classes based stop words categorization for Gujarati language

Abstract

Talk to us

Similar Papers