Irregular Arabic plurals recognition without stemming

Abduelbaset Goweder,Samira Eshafah,Ali Shafah,Ahmed Rgibi

doi:10.1109/ceit.2016.7929052

Abstract

With the growth of digital Arabic documents specially in information retrieval (IR) and natural language processing (NLP) applications, identification of irregular plurals which are commonly called broken plurals (BP) in modern standard Arabic becomes very urgent issue. Broken plurals are formed by imposing interdigitating patterns on stems, and singular words cannot be recovered by standard affix stripping stemming techniques. Identifying broken plurals is an important and difficult problem which needs to be addressed. In information retrieval, deriving singulars from plurals is referred to as a stemming. The process of stemming can be achieved by removing the attached affixes from a given word. To the best of our knowledge, all existing Arabic stemmers are unreliable and still under research. Consequently, this paper proposes an approach which identifies broken plurals without the need to perform the stemming process on any given word. The well known decision tree system (WEKA J48) is applied to build a classifier (model) on a very huge Arabic corpus as a training data which is pre-processed and prepared as a piece of this work. The built classifier is evaluated using unseen test set. The obtained results reveal that a very promising broken plural recognizer could be designed and implemented for NLP applications.

Full Text