Abstract

In order to infer topics from unstructured text data, topic modeling techniques is extensively employed in the field of Natural Language Processing. Latent Dirichlet Allocation (LDA), a popular technique in topic modeling, can be used for the automatic identification of topics from a vast sample of textual documents. The LDA-based topic models, however, may not always yield good outcomes on their own. One of the most efficient unsupervised machine learning methods, clustering, is often employed in applications like topic modeling and information extraction from unstructured textual data. In our study, a hybrid clustering based approach using Bidirectional Encoder Representations from Transformers (BERT) and LDA for large Bangla textual dataset has been thoroughly investigated. The BERT has done the contextual embedding with LDA. The experiments on this hybrid model are carried out to show the efficiency of clustering similar topics from a noble dataset of Bangla news articles. The outcomes of the experiments demonstrate that clustering with BERT-LDA model would aid in the inference of more coherent topics. The maximum coherence value of 0.63 has been found for our noble dataset using LDA and for BERT-LDA model, the value is 0.66.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.