A Comparison of Topic Modelling Approaches for Urdu Text

Siraj Munir,Syed Imran Jami,Shaukat Wasi

doi:10.17485/ijst/2019/v12i45/145722

Siraj Munir, Syed Imran Jami + Show 1 more

Open Access

https://doi.org/10.17485/ijst/2019/v12i45/145722

Copy DOI

Abstract

Objectives: Machine learning based approaches for topic modeling are successful in extracting logical and semantic topics from a given collection of text. We experimented topic modelling approaches for Urdu poetry text to show that these approaches perform equally well in any genre of text. Methods: Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), and Latent Semantic Indexing (LSI) were applied on three different datasets (i) CORPUS dataset for news, (ii) Poetry Collection of Dr. Allama Iqbal, and (iii) Poetry collection of miscellaneous poets. Furthermore, each poetry corpus includes more than five hundred poems approximately equivalent to 1200 documents. Findings: Before forwarding the raw text to aforementioned models, we did feature engineering comprising of (i) Tokenization and removal of special characters (if any), (ii) Removal of stop words, (iii) Lemmatization, and (iv) Stemming. For comparison of mentioned approaches on our test samples, we used coherence and dominance model. Applications: Our experiment shows that LDA, and LSI performed well on CORPUS dataset but none of the mentioned approaches performed well on poetry text. This brings us to a conclusion that we need to devise sequence based models that allow users to define weights for poetry specific text. This work opens a new direction for the domain of text generation and processing.Keywords: LDA, LSI, HDP, Urdu Poetry Processing, Urdu Poetry Collection, Topic Modelling.

Highlights

Machine learning is an approach to train machines on doing specific task efficiently
Natural Language Processing (NLP) is an immense branch of computer science which deals with understanding and processing human languages
Topic modelling is an area of Natural Language Understanding (NLU) based on statistical modelling to discover keywords which can represent complete/partial document using a dimension reduction technique which is applied on text data.[1]

Summary

Introduction

Machine learning is an approach to train machines on doing specific task efficiently. Topic modelling is an area of Natural Language Understanding (NLU) based on statistical modelling to discover keywords which can represent complete/partial document using a dimension reduction technique which is applied on text data.[1] In this study, we have compared three different models for topic modelling that is Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI), and Hierarchical Dirichlet Process (HDP) for topic modelling on Urdu News and Urdu poetry corpuses. LSI is a topic modelling approach which uses low rank approximation over Single Value Decomposition (SVD). LSI uses termdocument matrix integrated with SVD and occurrence matrix for its complete processing cycle. Occurrence matrix is same as term-frequency matrix, but here it is sparse in nature

Objectives

Methods

Results

Conclusion