Abstract

In order to infer topics from unstructured text data, topic modeling techniques is extensively employed in the field of Natural Language Processing. Latent Dirichlet Allocation (LDA), a popular technique in topic modeling, can be used for the automatic identification of topics from a vast sample of textual documents. The LDA-based topic models, however, may not always yield good outcomes on their own. One of the most efficient unsupervised machine learning methods, clustering, is often employed in applications like topic modeling and information extraction from unstructured textual data. In our study, a hybrid clustering based approach using Bidirectional Encoder Representations from Transformers (BERT) and LDA for large Bangla textual dataset has been thoroughly investigated. The BERT has done the contextual embedding with LDA. The experiments on this hybrid model are carried out to show the efficiency of clustering similar topics from a noble dataset of Bangla news articles. The outcomes of the experiments demonstrate that clustering with BERT-LDA model would aid in the inference of more coherent topics. The maximum coherence value of 0.63 has been found for our noble dataset using LDA and for BERT-LDA model, the value is 0.66.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call