Dimensionality reduction in the context of dynamic social media data streams

Moritz Heusinger,Christoph Raab,Frank-Michael Schleif

doi:10.1007/s12530-021-09396-z

Moritz Heusinger, Christoph Raab + Show 1 more

Open Access

https://doi.org/10.1007/s12530-021-09396-z

Copy DOI

Abstract

In recent years social media became an important part of everyday life for many people. A big challenge of social media is, to find posts, that are interesting for the user. Many social networks like Twitter handle this problem with so-called hashtags. A user can label his own Tweet (post) with a hashtag, while other users can search for posts containing a specified hashtag. But what about finding posts which are not labeled by the creator? We provide a way of completing hashtags for unlabeled posts using classification on a novel real-world Twitter data stream. New posts will be created every second, thus this context fits perfectly for non-stationary data analysis. Our goal is to show, how labels (hashtags) of social media posts can be predicted by stream classifiers. In particular, we employ random projection (RP) as a preprocessing step in calculating streaming models. Also, we provide a novel real-world data set for streaming analysis called NSDQ with a comprehensive data description. We show that this dataset is a real challenge for state-of-the-art stream classifiers. While RP has been widely used and evaluated in stationary data analysis scenarios, non-stationary environments are not well analyzed. In this paper, we provide a use case of RP on real-world streaming data, especially on NSDQ dataset. We discuss why RP can be used in this scenario and how it can handle stream-specific situations like concept drift. We also provide experiments with RP on streaming data, using state-of-the-art stream classifiers like adaptive random forest and concept drift detectors. Additionally, we experimentally evaluate an online principal component analysis (PCA) approach in the same fashion as we do for RP. To obtain higher dimensional synthetic streams, we use random Fourier features (RFF) in an online manner which allows us, to increase the number of dimensions of low dimensional streams.

Highlights

Textual data are very common and a key communication format in social media
We evaluate the performance of the classifiers on high dimensional space compared to low kw dimensional space, obtained by random projection (RP)
We evaluate the classifiers on the streams in the higher dimensional space, before principal component analysis (PCA) has been applied

Summary

Introduction

Textual data are very common and a key communication format in social media To analyze these data by machine learning models embedding approaches are used, leading to high dimensional representation (Mikolov et al 2013). This becomes even more challenging in non-stationary environments and is a severe problem. Given the data is intrinsically low dimensional, the dimensionality can be reduced with low error (Kaban 2015). To address this problem, there exist different projection and embedding methods. The most common projection and embedding techniques are the principal component analysis (PCA), locally linear embedding, ISOMAP, Multidimensional Scaling and t-distributed Stochastic Neighbor Embedding (t-SNE)

Objectives

Methods

Results

Conclusion