Spotting Urdu Stop Words By Zipf's Statistical Approach

Nuzhat Khan,Abdul Samad,Gul Sahar,Muhammad Junaid Khan,Muhammad Paend Bakht

doi:10.1109/macs48846.2019.9024817

Abstract

This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.

Full Text