Abstract

Due to the enormous growth of information and technology, the digitized texts and data are being immensely generated. Therefore, identifying the main topics in a vast collection of documents by humans is merely impossible. Topic modeling is such a statistical framework that infers the latent and underlying topics from text documents, corpus, or electronic archives through a probabilistic approach. It is a promising field in Natural Language Processing (NLP). Though many researchers have researched this field, only a few significant research has been done for Bangla. In this literature review paper, we have followed a systematic approach for reviewing topic modeling studies published from 2003 to 2020. We have analyzed topic modeling methods from different aspects and identified the research gap between topic modeling in English and Bangla language. After analyzing these papers, we have identified several types of topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Support Vector Machine (SVM), Bi-term Topic Modeling (BTM). Furthermore, this review paper also highlights the real-world applications of topic modeling. Several evaluation methods were used to evaluate these models’ performances, which we have discussed in this study. We conclude by mentioning the huge future research scopes for topic modeling in Bangla.

Highlights

  • Because of the rapid development of Information Technology (e.g., Internet, Social Media, Online Databases, etc.), the amount of data generated has exponentially exacerbated in recent years

  • Though Bangla is a very popular language in the world, there are barely any Topic Modeling techniques and studies out there to find. In this Systematic Literature Review (SLR), we provide a comprehensive view of topic modeling according to the literature and how algorithms and techniques differ between English and Bangla language

  • The basic idea can be described as: Documents consist of various topics, which are modeled as distributions over a vocabulary (Arora et al, 2013)

Read more

Summary

Introduction

Because of the rapid development of Information Technology (e.g., Internet, Social Media, Online Databases, etc.), the amount of data generated has exponentially exacerbated in recent years. This vast accumulation of data provides essential support for training machine learning models and easy access to search engine queries. According to the study of DOMO (a cloud-based business service system), roughly 2.5 Quintilian bytes of data are produced daily and 90% of that data in the world has been created in the last two years only (according to 2018 studies) (Al Helal and Mouhoub, 2018) It is not feasible for any person to sieve useful information from these vast amounts of data manually. A few of the topic modeling methods used in our reviewed papers are described in brief here

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.