An empirical evaluation of text representation schemes to filter the social media stream

Sandip Modha,Prasenjit Majumder,Thomas Mandl

doi:10.1080/0952813x.2021.1907792

Abstract

ABSTRACT Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.

Full Text