A Comparative Study of Dictionary-based and Machine Learning-based Named Entity Recognition in Pashto

Rafiullah Momand,Shakirullah Waseeb,Ahmad Masood Latif Rai

doi:10.1145/3443279.3443307

Abstract

Information Extraction (IE) is the process of extracting structured information from unstructured text using natural language processing (NLP). One important sub-task of IE is the extraction of names of persons, places, and organizations, called Named Entity Recognition (NER). NER plays an important role in many NLP applications such as Question Answering, Machine Translation, and Text Summarization. It has been widely studied for high-resource languages like English. However, no research has taken place in this regard for Pashto. We hypothesized that based on the research done for English and other languages in the area of NER a system can be developed for Pashto. We have developed two NER systems for detecting names of persons, places, and organizations in Pashto text. First, a dictionary-based NER that uses three dictionaries containing names of persons, locations, and organizations, respectively. Second, a learning-based approach that uses Hidden Markov Model (HMM) for the task. We have evaluated both systems on a dataset collected from sports news. Our evaluation showed F-Measure of 82% for HMM and 60% for dictionary-based NER. Our findings highlight that HMM outperforms dictionary based NER.

Full Text