Abstract

Every day the world produces an enormous amount of textual data. This unstructured text is of little use unless it is labeled using a combination of categories, keywords, tags. Humans can never annotate such massive data, and with a growing divide between the daily produced data and those annotated, the only alternative is to mechanize it. Automatic annotation process helps in saving resources in terms of time and cost. The process of multi-label annotation involves associating a document with multiple relevant labels. This paper proposes an unsupervised model to annotate corpus using multi-labels automatically. The model is based on multi-label topic modeling and genetic algorithm (GA). Topic modeling is a technique to extract the hidden topics from text, and the GA is used to find the optimal number of topics. We hyper-tuned the parameters of the topic modeling using two different training methods: variational Bayes and Gibbs sampling. The class imbalance in a corpus can affect the result of topic modeling, where the majority class dominates the minority class. We overcome this problem using the partitioning method. Though the proposed model was developed for the Arabic dataset, it is language neutral. We tested our model on three large Arabic corpora and three large English social media datasets. For the Arabic language, our work being the first work that tackles multi-label annotation, we needed a reference to compare our model. For the Arabic corpus, we compared the result of automatic annotation against humans using crowdsourcing (whose labeling was checked for quality). The analysis of the annotation shows an agreement among models (machine vs. human) of 79.30%. Moreover, for the English dataset, the results are quite competitive.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.