Distributed evidential clustering toward time series with big data issue

Chaoyu Gong,Zhi-Gang Su,Pei-Hong Wang,Yang You

doi:10.1016/j.eswa.2021.116279

Abstract

To analyze time series data with large volume, most of the existing clustering algorithms focus on data reduction techniques or multi-level strategies. However, the destruction of raw data structure is inevitable, leading to the information loss, even an abnormal and unaccountable clustering result. To tackle above issues, we propose a distributed evidential clustering algorithm that can be directly adopted on the raw big data, and parallelize it under the Apache Spark, which is a processing engine built for sophisticated data analysis. Concretely, in a parallel way, the possibility of becoming a cluster center is first calculated for each data object under the framework of evidence theory. After drawing a decision graph, the cluster centers are determined and then a credal partition of the time series data is derived. Without simplifying the data structure, the proposed algorithm can not only detect the number of clusters, but also describe the ambiguity and uncertainty in memberships of every data object. Experiments on several benchmark datasets and one real-world problem show the strong scalability and well clustering performance of the introduced algorithm.

Full Text