Abstract
In ubiquitous streaming data sources, such as sensor networks, clustering nodes by the data they produce gives insights on the phenomenon being monitored. However, centralized algorithms force communication and storage requirements to grow unbounded. This article presents L2GClust, an algorithm to compute local clusterings at each node as an approximation of the global clustering. L2GClust performs local clustering of the sources based on the moving average of each node’s data over time: the moving average is approximated using memory-less statistics; clustering is based on the furthest-point algorithm applied to the centroids computed by the node’s direct neighbors. Evaluation is performed both on synthetic and real sensor data, using a state-of-the-art sensor network simulator and measuring sensitivity to network size, number of clusters, cluster overlapping, and communication incompleteness. A high level of agreement was found between local and global clusterings, with special emphasis on separability agreement, while an overall robustness to incomplete communications emerged. Communication reduction was also theoretically shown, with communication ratios empirically evaluated for large networks. L2GClust is able to keep a good approximation of the global clustering, using less communication than a centralized alternative, supporting the recommendation to use local algorithms for distributed clustering of streaming data sources.
Highlights
Nowadays, information is generated and gathered from distributed data sources, at a very high rate, stressing communications and computing infrastructure
Clustering streaming data sources is the task of clustering different sources of data streams, based on the data series similarity.[1]
Algorithms aim to find groups of data sources that behave through time, which is usually measured in terms of the distance between the data series or the data distribution
Summary
Information is generated and gathered from distributed data sources, at a very high rate, stressing communications and computing infrastructure. The moving average of each node is approximated using memoryless fading average, while clustering is based on the furthest-point algorithm applied to the centroids computed by the node’s direct neighbors This way, each sensor acts as data stream source and as a processing node, keeping a sketch of its own data and a definition of the clustering structure of the entire network of data sources. The idea behind this step is to aggregate all the locally defined centers and apply a clustering procedure on these centers, considering them as points for the clustering This way, if time this sensor uses or transmits its estimate Cx(i) of the global clustering structure, it is already updated with its most recent sketch and neighbors’ information.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have