Topic Modeling for Amharic User Generated Texts

Girma Neshir,Solomon Atnafu,Andreas Rauber

doi:10.3390/info12100401

Abstract

Topic Modeling is a statistical process, which derives the latent themes from extensive collections of text. Three approaches to topic modeling exist, namely, unsupervised, semi-supervised and supervised. In this work, we develop a supervised topic model for an Amharic corpus. We also investigate the effect of stemming on topic detection on Term Frequency Inverse Document Frequency (TF-IDF) features, Latent Dirichlet Allocation (LDA) features and a combination of these two feature sets using four supervised machine learning tools, that is, Support Vector Machine (SVM), Naive Bayesian (NB), Logistic Regression (LR), and Neural Nets (NN). We evaluate our approach using an Amharic corpus of 14,751 documents of ten topic categories. Both qualitative and quantitative analysis of results show that our proposed supervised topic detection outperforms with an accuracy of 88% by SVM using state-of-the-art-approach TF-IDF word features with the application of the Synthetic Minority Over-sampling Technique (SMOTE) and with no stemming operation. The results show that text features with stemming slightly improve the performance of the topic classifier over features with no stemming.

Highlights

With the rapid advancement of social media technologies, there is a vast accumulation of user generated content on different topics
We address the following research questions: (1) Does Latent Dirichlet Allocation (LDA) provide a suitable feature set for discriminating Amharic user generated texts into a specific topic category? (2) Do preprocessing operations, stemmers, have a positive effect on the topic modeling of Amharic user generated text? (3) To what extent does supervised topic detection improve topic classification? (4) To what extent are the topic categories accurately predicted by the trained model?
We provide annotated datasets of user generated content for supervised topic modeling in Amharic [15]; We identify the most salient features (TF-inverse document frequency (IDF), LDA or combinations) to discriminate topics by machine learning models; We investigate the effect of stemming with Term Frequency Inverse Document Frequency (TF-IDF) word feature on identification of topics of Amharic texts; Information 2021, 12, 401

Summary

Introduction

With the rapid advancement of social media technologies, there is a vast accumulation of user generated content on different topics. Because of the emergence of different online platforms—news posts, social media platforms and other sources—the contents are available in various forms—texts, audio, video, images and graphics. Among these contents, the volume of textual content takes up the larger proportion—80% of the existing content [1]. The volume of textual content takes up the larger proportion—80% of the existing content [1] This content is not limited to the well-resourced languages but content for less-resourced languages, such as Amharic, is increasing quickly

Methods

Results

Discussion

Conclusion