Abstract

Topic modeling is a powerful technique for unsupervised analysis of large document collections. Topic models have a wide range of applications including tag recommendation, text categorization, keyword extraction and similarity search in the text mining, information retrieval and statistical language modeling. The research on topic modeling is gaining popularity day by day. There are various efficient topic modeling techniques available for the English language as it is one of the most spoken languages in the whole world but not for the other spoken languages. Bangla being the seventh most spoken native language in the world by population, it needs automation in different aspects. This paper deals with finding the core topics of Bangla news corpus and classifying news with similarity measures. The document models are built using LDA (Latent Dirichlet Allocation) with bigram.

Highlights

  • IntroductionThe amount of data generated by people made history. roughly 2.5 quintillion bytes of data is produced daily according to the study of DOMO and ninety percent of the data in the world has been created in the last two years alone [1]

  • During the last decade, the amount of data generated by people made history

  • We have demonstrated how topic modelling can be extended with the Bangla language in a large scale

Read more

Summary

Introduction

The amount of data generated by people made history. roughly 2.5 quintillion bytes of data is produced daily according to the study of DOMO and ninety percent of the data in the world has been created in the last two years alone [1]. The content is rich enough, the research in Bangla is not frequent due to insufficient datasets, unorganized grammar rules which is the core challenge to work with Bangla Considering these challenges, we have created our own corpus and proposed the first ever topic modeling tool for Bangla. The objective was to find similar articles for a scientist out of millions of journals, conference papers, etc This is one kind of categorization of texts using LDA. Similarity measure for the Wikipedia data was calculated and demonstrated for different articles against a selected article Another trend finding work on topic modelling with LDA was done in [9] where the goal is to investigate the research development and current trends from a collection of scholarly articles.

Data Preprocessing
Experimentation
Similarity Measure
Classifying News
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call