Abstract

To analyze time series data with large volume, most of the existing clustering algorithms focus on data reduction techniques or multi-level strategies. However, the destruction of raw data structure is inevitable, leading to the information loss, even an abnormal and unaccountable clustering result. To tackle above issues, we propose a distributed evidential clustering algorithm that can be directly adopted on the raw big data, and parallelize it under the Apache Spark, which is a processing engine built for sophisticated data analysis. Concretely, in a parallel way, the possibility of becoming a cluster center is first calculated for each data object under the framework of evidence theory. After drawing a decision graph, the cluster centers are determined and then a credal partition of the time series data is derived. Without simplifying the data structure, the proposed algorithm can not only detect the number of clusters, but also describe the ambiguity and uncertainty in memberships of every data object. Experiments on several benchmark datasets and one real-world problem show the strong scalability and well clustering performance of the introduced algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.