Abstract

With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call