The rise of social media has led to an increasing online cyber-war via hate and violent comments or speeches, and even slick videos that lead to the promotion of extremism and radicalization. An analysis to sense cyber-extreme content from microblogging sites, specifically Twitter, is a challenging, and an evolving research area since it poses several challenges owing short, noisy, context-dependent, and dynamic nature content. The related tweets were crawled using query words and then carefully labelled into two classes: Extreme (having two sub-classes: pro-Afghanistan government and pro-Taliban) and Neutral. An Exploratory Data Analysis (EDA) using Principal Component Analysis (PCA), was performed for tweets data (having Term Frequency—Inverse Document Frequency (TF-IDF) features) to reduce a high-dimensional data space into a low-dimensional (usually 2-D or 3-D) space. PCA-based visualization has shown better cluster separation between two classes (extreme and neutral), whereas cluster separation, within sub-classes of extreme class, was not clear. The paper also discusses the pros and cons of applying PCA as an EDA in the context of textual data that is usually represented by a high-dimensional feature set. Furthermore, the classification algorithms like naïve Bayes’, K Nearest Neighbors (KNN), random forest, Support Vector Machine (SVM) and ensemble classification methods (with bagging and boosting), etc., were applied with PCA-based reduced features and with a complete set of features (TF-IDF features extracted from n-gram terms in the tweets). The analysis has shown that an SVM demonstrated an average accuracy of 84% compared with other classification models. It is pertinent to mention that this is the novel reported research work in the context of Afghanistan war zone for Twitter content analysis using machine learning methods.
Read full abstract