Term weighting scheme for short-text classification: Twitter corpuses

Issa Alsmadi,Gan Keng Hoon

doi:10.1007/s00521-017-3298-8

Abstract

Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.

Full Text