Parallel implementations of incremental clustering have been provided to increase performances of data stream processing in smart factories, to enable real-time anomaly detection, remote diagnosis, condition-based monitoring of Cyber-Physical Systems. Incremental clustering algorithms iteratively extract and update over time clusters of data points (often denoted as micro-clusters) whose maximum number is bounded. However, the capability of controlling costs derived from the exploitation of computational resources on the distributed architecture is challenging to enable a sustainable processing of massive data streams. In this paper, we present a multi-level parallelization approach for clustering massive data streams based on an horizontal scaling platform for Big Data processing. In particular, the following levels are considered: (i) a first parallelization level is based on a multi-dimensional model with exploration facets used to perform a first, coarse-grained partition of data streams, according to a divide-and-conquer strategy; (ii) a second parallelization level is based on a buffering mechanism, that splits the data stream into portions of data points on which processing is performed in parallel; (iii) the third level of parallelization is defined over the set of micro-clusters that are generated and change over time. The approach is conceived for anomaly detection in smart manufacturing, where the concept of data relevance, defined in terms of distance from critical conditions of monitored systems, is used in order to force a stronger parallelization (and therefore higher resource usage) only when necessary, that is, when approaching to critical conditions. The scalability and efficiency of the approach are evaluated using a real dataset in a smart factory scenario. In particular, experiments demonstrated that when the maximum number of allowed micro-clusters decreases and the buffer size increases, parallelization based on buffering does not ensure good scalability. Additionally, as the number of features (that is, the complexity of data stream) increases, the parallelization based on buffering may present scalability issues. This paves the way to the advantages of tuning different parallelization levels according to the approach proposed in this paper.
Read full abstract