SinTM - LDA and RAKE based Topic Modelling for Sinhala Language

Saman Hettiarachchi,R.M.D.R Kumari

doi:10.1109/asiancon51346.2021.9545070

Abstract

The advancement of technology increased the usage of textual information in the world. Growing of such numerous types of unstructured and heterogeneous text data become hard to manage. Topic modelling is a technique that retrieves abstract topics from a collection of documents and this technique is highly important to discover hidden and useful information from huge unstructured and heterogeneous text data. Sinhala is the native language in Sri Lanka and primarily spoken by Sinhalese. This paper presents a novel approach called SinTM to analyze a single Sinhala text document by combining topic modelling and keyword extraction techniques. The results were benchmarked with the well-known topic modelling algorithm Latent Dirichlet Allocation (LDA) and the SinTM was tested with prominent topic model evaluation matrices likelihood, r-squared, perplexity and coherence. We show that the SinTM can perform better results for Sinhala than the LDA.

Full Text