Developing and performance evaluation of a new Arabic heavy/light stemmer

Imad Zeroual,Mohamed Boudchiche,Azzeddine Mazroui,Abdelhak Lakhouaja

doi:10.1145/3090354.3090371

Abstract

Stemming is the main step used for handling the morphologically rich languages such as Arabic. It is usually used in several fields such as Natural Language Processing, Information Retrieval (IR), and Text Mining. The goal of stemming is reducing inflected or derived words to their base (root or stem), from a generally written word form. Considering that Arabic is mainly dependent on roots and patterns to generate words, a new efficient heavy/light stemmer is developed based on the interaction between roots and patterns; yet, rich linguistic resources are involved. This stemmer provides three different outputs: individual root, a stem, and a combination of stem/root. In this paper, we highlight the performance of the developed stemmer via various experiments on both Modern Standard Arabic and Classical Arabic. In fact, the achieved accuracies are 96.93% and 96.56% for respectively the Quranic corpus Al-Mus'haf and NEMLAR corpus. In the context of usability testing, the effectiveness of the stemmer on IR and Part of Speech (PoS) tagging are studied. The obtained results indicate an improvement in PoS tagging by 10.98% and by 14.12% in search efficiency.

Full Text