Topic Modelling in Bangla Language: An LDA Approach to Optimize Topics and News Classification

Malek Mouhoub,Mustakim Al Helal

doi:10.5539/cis.v11n4p77

Abstract

Topic modeling is a powerful technique for unsupervised analysis of large document collections. Topic models have a wide range of applications including tag recommendation, text categorization, keyword extraction and similarity search in the text mining, information retrieval and statistical language modeling. The research on topic modeling is gaining popularity day by day. There are various efficient topic modeling techniques available for the English language as it is one of the most spoken languages in the whole world but not for the other spoken languages. Bangla being the seventh most spoken native language in the world by population, it needs automation in different aspects. This paper deals with finding the core topics of Bangla news corpus and classifying news with similarity measures. The document models are built using LDA (Latent Dirichlet Allocation) with bigram.

Highlights

IntroductionThe amount of data generated by people made history. roughly 2.5 quintillion bytes of data is produced daily according to the study of DOMO and ninety percent of the data in the world has been created in the last two years alone [1]
During the last decade, the amount of data generated by people made history
We have demonstrated how topic modelling can be extended with the Bangla language in a large scale

Summary

Introduction

The amount of data generated by people made history. roughly 2.5 quintillion bytes of data is produced daily according to the study of DOMO and ninety percent of the data in the world has been created in the last two years alone [1]. The content is rich enough, the research in Bangla is not frequent due to insufficient datasets, unorganized grammar rules which is the core challenge to work with Bangla Considering these challenges, we have created our own corpus and proposed the first ever topic modeling tool for Bangla. The objective was to find similar articles for a scientist out of millions of journals, conference papers, etc This is one kind of categorization of texts using LDA. Similarity measure for the Wikipedia data was calculated and demonstrated for different articles against a selected article Another trend finding work on topic modelling with LDA was done in [9] where the goal is to investigate the research development and current trends from a collection of scholarly articles.

Data Preprocessing

Experimentation

Similarity Measure

Classifying News

Conclusion and Future Work