Cluster-discovery of Twitter messages for event detection and trending

Shakira Banu Kaleel

doi:10.32920/ryerson.14663040.v1

Shakira Banu Kaleel

Open Access

PDF Available

https://doi.org/10.32920/ryerson.14663040.v1

Copy DOI

Export

Save

Cite

Publication Date: Oct 20, 2022

License type: cc-by

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Social media data carries abundant hidden occurrences of real-time events in the world which raises the demand for efficient event detection and trending system. The Locality Sensitive Hashing (LSH) technique is capable of processing the large-scale big datasets. In this thesis, a novel framework is proposed for detecting and trending events from tweet clusters presence in Twitter1 dataset that are discovered using LSH. The experimental results obtained from this research work showed that the LSH technique took only 12.99% of the running time compared to that required for K-means to find all of the tweet clusters. Key challenges include: 1) construction of dictionary using incremental TF-IDF in high-dimensional data in order to create tweet feature vector 2) leveraging LSH to find truly interesting events 3) trending the behavior of event based on time, geo-locations and cluster size and 4) speed-up the cluster-discovery process while retaining the cluster quality.

Highlights

Online social media provides an abundance of data on public opinions which can be used to extract the occurrences of real-time events in the world
This study deals the problem of creating a tweet feature vector in high dimensional data by using a static dictionary constructed for each chunk using an incremental Term Frequency - Inverse Document Frequency (TF-inverse document frequency (IDF)) technique
State-of-art technique is suitable for analysing the large-scale social media data because of its capability to fast processing of real-time data

Summary

Introduction

Online social media provides an abundance of data on public opinions which can be used to extract the occurrences of real-time events in the world. The LSH technique is employed to find tweet clusters from which events are detected and trended. Works attempts to use Charikar’s approach to compute the K-bit signature for a tweet feature vector which is further used as input for the prefix tree based LSH approach proposed by Kamath et al [25] to discover the tweet clusters from which the event is detected and it is trended. Background information on the prefix tree data structure which is used in LSH approach to replace the hash table is discussed . We leverage the prefix tree data structure in this thesis to find the nearest neighbour of the given tweet. Works it a popular data source for real-time event detection. E2LSH was compared with the K-means algorithm and it was confirmed that E2LSH boosts the retrieval accuracy compared to the K-means algorithm at the extra cost of response time due to the query expansion [51]

Objectives

Methods

Results

Conclusion