Categorization of Event Clusters from Twitter Using Term Weighting Schemes

Surender Singh Samant,Nl Bhanu Murthy,Aruna Malapati

doi:10.31449/inf.v45i3.3063

Abstract

A real-world event is commonly represented on Twitter as a collection of repetitive and noisy text messages posted by different users. Term weighting is a popular pre-processing step for text classification, especially when the size of the dataset is limited. In this paper, we propose a new term weighting scheme and a modification to an existing one and compare them with many state-of-the-art methods using three popular classifiers. We create a labelled Twitter dataset of events for exhaustive cross-validation experiments and use another Twitter event dataset for cross-corpus tests. The proposed schemes are among the best performers in many experiments, with the proposed modification significantly improving the performance of the original scheme. We create two majority voting based classifiers that further enhance the F1-scores of the best individual schemes.

Highlights

Twitter is a popular microblogging platform with millions of active users 1 posting messages every day [18]
There is a limit to the maximum allowed length of a message (e.g. Twitter restricts the length to 280 characters)
As Support Vector Machine (SVM) has given the best scores in this experiment, we report results based on SVM for the remaining 10-fold cross-validation (10-CV) experiments

Summary

Introduction

Twitter is a popular microblogging platform with millions of active users 1 posting (publishing) messages (tweets) every day [18]. We consider an event as any newsworthy real-world occurrence discussed on Twitter. For this reason, we use the terms event and news interchangeably. Term-weighting schemes have traditionally been one of the most popular pre-processing methods for text categorization. These schemes are applicable even when the dataset is not very big. Can the existing term weighting schemes categorize noisy and repetitive Twitter event clusters effectively? – We propose a new term weighting scheme and a modification to an existing scheme for event categorization.

Unsupervised methods

Supervised methods

Proposed method

Proposed modification to χ2

Twitter datasets

Normalized event clusters

Sub-datasets

Cross validation experiments using Events1461

Cross validation on sub-datasets

Cross-validation on normalized event clusters

Cross-corpus event classification

General text categorization

Voting-schemes based classifiers

Findings

Conclusion and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Categorization of Event Clusters from Twitter Using Term Weighting Schemes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatica

Lead the way for us

Journal: Informatica	Publication Date: Sep 15, 2021
License type: cc-by

Similar Papers

Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
Surender Singh Samant ... N L Bhanu Murthy
IEEE Access | VOL. 7
Surender Singh Samant, et. al.Surender Singh Samant ... N L Bhanu Murthy
01 Jan 2019
IEEE Access | VOL. 7

Term weighting scheme for short-text classification: Twitter corpuses
Issa Alsmadi ... Gan Keng Hoon
Neural Computing and Applications | VOL. 31
Issa Alsmadi, et. al.Issa Alsmadi ... Gan Keng Hoon
06 Jan 2018
Neural Computing and Applications | VOL. 31

Location-Aware Model for News Events in Social Media
Mauricio Quezada ... Barbara Poblete
-
Mauricio Quezada, et. al.Mauricio Quezada ... Barbara Poblete
09 Aug 2015
09 Aug 2015

Supervised term weighting for automated text categorization
Franca Debole ... Fabrizio Sebastiani
-
Franca Debole, et. al.Franca Debole ... Fabrizio Sebastiani
09 Mar 2003
09 Mar 2003

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Categorization of Event Clusters from Twitter Using Term Weighting Schemes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatica