The Use of Stemming in the Arabic Text and Its Impact on the Accuracy of Classification

Jaffar Atwan,Ryan Alturki,Ahmad Hammadeen,Qusay Bsoul,Mohammad Wedyan,Muhammad Usman

doi:10.1155/2021/1367210

Jaffar Atwan, Ryan Alturki + Show 4 more

Open Access

https://doi.org/10.1155/2021/1367210

Copy DOI

Abstract

The ongoing growth in the vast amount of digital documents and other data in the Arabic language available online has increased the need for classification methods that can deal with the complex nature of such data. The classification of Arabic plays a large and important role in many modern applications and interferes with other sciences, which start from search engines and do not end with the Internet of Things. However, addressing the Arab classification errors with high performance is largely insufficient to deal with the huge quantities to reveal the classification of Arab documents; while some work was tackled out on the classification of the Arabic text, most of the research has focused on English text. The methods proposed for English are not suitable for Arabic as the morphology of the two languages differs substantially. Moreover, morphologically, the preprocessing of Arabic text is a particularly challenging task. In this study, three commonly used classification algorithms, namely, the K-nearest neighbor, Naïve Bayes, and decision tree, were implemented for Arabic text in order to assess their effectiveness with and without the use of a light stemmer in the preprocessing phase. In the experiment, a dataset from Agency France Persse (AFP) Arabic Newswire 2001 consisting of four categories and 800 files was classified using the three classifiers. The result showed that the decision tree with light stemmer had the best accuracy rate for classification algorithm with 93%.

Highlights

Machine learning (ML) is a branch of Artificial Intelligence (AI) research [1], which aims to develop practically relevant multipurpose algorithms based on a little amount of data
According to [2], ML classification technique involves combining several instances together with their known labels by manually tagging a group of instances. e group of labeled instances is recognized as a training set. e labeled instances are used by classifier to generate the model that maps the instance to its label
When a stemmer was included in the preprocessing phase, all three classifiers improved their performance, and again, decision tree (DT) produced the best result with 93% as compared to Naıve Bayes (NB) with 35% and K-nearest neighbor (KNN) with 26.36%. us, the use of a stemmer improved the accuracy of all three classifiers

Summary

Introduction

Machine learning (ML) is a branch of Artificial Intelligence (AI) research [1], which aims to develop practically relevant multipurpose algorithms based on a little amount of data. E two major forms of ML are supervised and unsupervised learning. We consider the former, which involves the generation of a mapping from labeled training data into an output of predictions or classes. Classification involves the determination of output values known as classes or labels using input objects. Is mapping is known as a model or classifier. According to [2], ML classification technique involves combining several instances together with their known labels by manually tagging a group of instances. E group of labeled instances is recognized as a training set. E labeled instances (i.e., training set) are used by classifier to generate the model that maps the instance to its label.

Objectives

Results

Conclusion