Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation

Rianto Rianto,Eri Prasetyo Wibowo,Paulus Insap Santosa,Achmad Benny Mutiara

doi:10.1186/s40537-021-00413-1

Rianto Rianto, Eri Prasetyo Wibowo + Show 2 more

Open Access

https://doi.org/10.1186/s40537-021-00413-1

Copy DOI

Abstract

BackgroundStemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods.FindingsThe results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects.ConclusionThe existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.

Highlights

As social beings, humans always interact with one another
Indonesian language is classified into two categories, namely formal and non-formal in the method of use
Indonesian formal is used in formality situation, while Indonesian non-formal is widely used in a casual situation like in social media conversations [2]

Summary

Introduction

Humans always interact with one another. The interactions were carried out in verbal or non-verbal language. Language is an arbitrary sound symbol system, which is used by members of a community to cooperate, interact, and identify themselves [1]. This definition implies that language has a special character which is the Rianto et al J Big Data (2021) 8:26 identity of a country and domain of a dialog topic. Indonesian language is classified into two categories, namely formal and non-formal in the method of use. There are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods

Objectives

Methods

Results

Conclusion