Abstract

In an era of ubiquitous large-scale evolving data streams, data stream clustering (DSC) has received lots of attention because the scale of the data streams far exceeds the ability of expert human analysts. It has been observed that high-dimensional data are usually distributed in a union of low-dimensional subspaces. In this article, we propose a novel sparse representation-based DSC algorithm, called evolutionary dynamic sparse subspace clustering (EDSSC). It can cope with the time-varying nature of subspaces underlying the evolving data streams, such as subspace emergence, disappearance, and recurrence. The proposed EDSSC consists of two phases: 1) static learning and 2) online clustering. During the first phase, a data structure for storing the statistic summary of data streams, called EDSSC summary, is proposed which can better address the dilemma between the two conflicting goals: 1) saving more points for accuracy of subspace clustering (SC) and 2) discarding more points for the efficiency of DSC. By further proposing an algorithm to estimate the subspace number, the proposed EDSSC does not need to know the number of subspaces. In the second phase, a more suitable index, called the average sparsity concentration index (ASCI), is proposed, which dramatically promotes the clustering accuracy compared to the conventionally utilized SCI index. In addition, the subspace evolution detection model based on the Page-Hinkley test is proposed where the appearing, disappearing, and recurring subspaces can be detected and adapted. Extinct experiments on real-world data streams show that the EDSSC outperforms the state-of-the-art online SC approaches.

Highlights

  • H IGH-DIMENSIONAL data streams are generated at an unprecedented scale in various realms, such as media, communication, finance, meteorology, etc., [1]–[4]

  • On the ExYaleB data stream, EDSSC achieves 75.01% accuracy and 86.47% normalized mutual information (NMI) compared with 57.14% accuracy and 74.43% NMI of OLRSC which has the best performance among all baseline algorithms

  • The goal of this article is to perform data stream clustering (DSC) on the evolving high-dimensional data streams, that is, providing a timevarying Subspace clustering (SC) result St at each timestamp t which reflects the partition of received points Xt such that the points belonging to the same subspace can be assigned to the same cluster

Read more

Summary

Introduction

H IGH-DIMENSIONAL data streams are generated at an unprecedented scale in various realms, such as media, communication, finance, meteorology, etc., [1]–[4]. These data streams are often high dimensional, unlabeled, large scale, and evolving, which present huge challenges for data stream clustering (DSC). Representation-based SC (RBSC) approaches have been dominating the field and represent the state of the art They are based on the hypothesis that each data point in a union of subspaces can be represented as a linear combination of other points, that is, the so-called selfexpressiveness property. Popular RBSC approaches include sparse SC (SSC) [1], low-rank representation (LRR) [16], and their variants

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.