Abstract

A real-world event is commonly represented on Twitter as a collection of repetitive and noisy text messages posted by different users. Term weighting is a popular pre-processing step for text classification, especially when the size of the dataset is limited. In this paper, we propose a new term weighting scheme and a modification to an existing one and compare them with many state-of-the-art methods using three popular classifiers. We create a labelled Twitter dataset of events for exhaustive cross-validation experiments and use another Twitter event dataset for cross-corpus tests. The proposed schemes are among the best performers in many experiments, with the proposed modification significantly improving the performance of the original scheme. We create two majority voting based classifiers that further enhance the F1-scores of the best individual schemes.

Highlights

  • Twitter is a popular microblogging platform with millions of active users 1 posting messages every day [18]

  • There is a limit to the maximum allowed length of a message (e.g. Twitter restricts the length to 280 characters)

  • As Support Vector Machine (SVM) has given the best scores in this experiment, we report results based on SVM for the remaining 10-fold cross-validation (10-CV) experiments

Read more

Summary

Introduction

Twitter is a popular microblogging platform with millions of active users 1 posting (publishing) messages (tweets) every day [18]. We consider an event as any newsworthy real-world occurrence discussed on Twitter. For this reason, we use the terms event and news interchangeably. Term-weighting schemes have traditionally been one of the most popular pre-processing methods for text categorization. These schemes are applicable even when the dataset is not very big. Can the existing term weighting schemes categorize noisy and repetitive Twitter event clusters effectively? – We propose a new term weighting scheme and a modification to an existing scheme for event categorization.

Unsupervised methods
Supervised methods
Proposed method
Proposed modification to χ2
Twitter datasets
Normalized event clusters
Sub-datasets
Cross validation experiments using Events1461
Cross validation on sub-datasets
Cross-validation on normalized event clusters
Cross-corpus event classification
General text categorization
Voting-schemes based classifiers
Findings
Conclusion and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call