The Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming

Yasir Alhanini,Mohd Juzaiddin Ab Aziz

doi:10.4236/jsea.2011.49060

Abstract

Word stemming is one of the most important factors that affect the performance of many natural language processing applications such as part of speech tagging, syntactic parsing, machine translation system and information retrieval systems. Computational stemming is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. The existing stemmers have ignored the handling of multi-word expressions and identification of Arabic names. We used the enhanced stemming for extracting the stem of Arabic words that is based on light stemming and dictionary-based stemming approach. The enhanced stemmer includes the handling of multiword expressions and the named entity recognition. We have used Arabic corpus that consists of ten documents in order to evaluate the enhanced stemmer. We reported the accuracy values for the enhanced stemmer, light stemmer, and dictionary-based stemmer in each document. The results obtain shows that the average of accuracy in enhanced stemmer on the corpus is 96.29%. The experimental results showed that the enhanced stemmer is better than the light stemmer and dictionary-based stemmer that achieved highest accuracy values.

Highlights

Word stemming is one of the most important factors that affect the performance of many natural language processing applications such as part of speech tagging, syntactic parsing, machine translation system and information retrieval systems
The accuracy values of enhanced stemmer had been increased in all documents in the corpus when they compared with the accuracy values in light and dictionary-based stemmer
We have presented the enhanced stemming for extracting the stem and root of Arabic words

Summary

Introduction

Word stemming is one of the most important factors that affect the performance of many natural language processing applications such as part of speech tagging, syntactic parsing, machine translation system and information retrieval systems. In Arabic, there are two main approaches for stemming: light stemming and dictionary-based stemming. The light stemming is the affix removal approach that refers to a process of stripping off a small set of prefixes and/or suffixes to find the root of the word. The dictionary-based stemming is the morphological approach that depends on set of lexicons of Arabic stems, prefixes, and suffixes to extract the stem of words. This stemming can find the stem of the broken (irregular) plurals for nouns and irregular verbs, because the stem of these irregular words had been entered. The handling of the multiword expressions is to avoid the needless analysis of structure, and to reduce the stemming ambiguity and time of stemming

Methods

Discussion

Conclusion