The Urdu language is used by approximately 200 million people for spoken and written communications on a daily basis. There is a substantial amount of unstructured Urdu textual data that is available worldwide. Data mining techniques can be used to extract meaningful knowledge from such a large, potentially informative source of data. There are many text processing systems available to process unstructured textual data. However, these systems are mostly language specific and developed for a variety of languages such as English, Spanish, Chinese, etc. Unfortunately, there are not as many language processing resources available for Urdu. Stemming is one of the most important preprocessing steps in the text mining process and its goal is to reduce grammatical words form, e.g., parts of speech, gender, tense, and so on, to their root form. In this work, we have extended the stemming capabilities of our existing pattern-based comprehensive stemming system for Urdu text. In addition to the existing stemming rules in previous work, we introduce novel stemming rules for prefix, and infix stemming. We also optimize the existing suffix removal rules and extend the add character lists for word normalization. These stemming rules are generic and have the ability to generate the stem of Urdu words as well as loan words (words belonging to other languages i.e. Arabic, Persian, Turkish, etc). In the experimental evaluation, we have observed a significant improvement in the overall stemming accuracy of our proposed pattern-based Urud stemmer, which demonstrates the adoptability of the proposed stemming approach for a variety of text-processing applications.
Read full abstract