An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm

Huda A Almuzaini,Aqil M Azmi

doi:10.1016/j.eswa.2022.117384

Abstract

Every day the world produces an enormous amount of textual data. This unstructured text is of little use unless it is labeled using a combination of categories, keywords, tags. Humans can never annotate such massive data, and with a growing divide between the daily produced data and those annotated, the only alternative is to mechanize it. Automatic annotation process helps in saving resources in terms of time and cost. The process of multi-label annotation involves associating a document with multiple relevant labels. This paper proposes an unsupervised model to annotate corpus using multi-labels automatically. The model is based on multi-label topic modeling and genetic algorithm (GA). Topic modeling is a technique to extract the hidden topics from text, and the GA is used to find the optimal number of topics. We hyper-tuned the parameters of the topic modeling using two different training methods: variational Bayes and Gibbs sampling. The class imbalance in a corpus can affect the result of topic modeling, where the majority class dominates the minority class. We overcome this problem using the partitioning method. Though the proposed model was developed for the Arabic dataset, it is language neutral. We tested our model on three large Arabic corpora and three large English social media datasets. For the Arabic language, our work being the first work that tackles multi-label annotation, we needed a reference to compare our model. For the Arabic corpus, we compared the result of automatic annotation against humans using crowdsourcing (whose labeling was checked for quality). The analysis of the annotation shows an agreement among models (machine vs. human) of 79.30%. Moreover, for the English dataset, the results are quite competitive.

Full Text