A framework of Urdu topic modeling using latent dirichlet allocation (LDA)

Khadija Shakeel,Irsha Tehseen,Mubashir Ali,Ghulam Rasool Tahir

doi:10.1109/ccwc.2018.8301655

Abstract

In this age, text mining research community has given an immense attention towards the development of text mining tools, techniques and models. Topic modeling is an area of Text Mining which is being used in various areas e.g. summarization, searching, semantics, and many other. Topic Modeling is used to uncover the hidden topics from large collection of documents or text. It is also equally important for many other interesting research areas like Natural Language Processing (NLP), Machine Learning (ML), statistics etc. In order to fulfill the goal of topic model, a lot of models have been proposed, in literature, for variety of languages such as English and Arabic etc. All of the models differ in their various nature, theories, and implementation strategies as all languages has their own morphological structure, semantics and syntax. The motivation behind this work is that there is no such work is available for Urdu language to extract topics from documents. Although some standard Topic Models has been proposed such as Latent Dirichlet Allocation (LDA), there is still a need of development of comprehensive model to cater Topic Model specific for Urdu text. In this research, we have proposed an effective topic model for Urdu language to cope with the challenges of Urdu morphological structure. The proposed Topic Model for Urdu is a framework that combine pre-processing techniques, LDA model and Gibbs sampling. This proposed Topic Model for Urdu used the standard LDA model therefore we named it Urdu Latent Dirichlet Allocation (ULDA). Experiments are conducted to show the efficacy of our proposed approach as compared to the competitors. The experimental results show the dominance of our proposed ULDA model as compared to existing systems. The work is being carried out for the first time in Urdu language.

Full Text