Abstract

Despite the recent growth in large-scale, distributed streaming data processing, there are currently limited options for flexible clustering of streaming data on industrial production frameworks such as Mapreduce and Spark. The clustering methods being used by practitioners on such systems do not respond rapidly to new data, or do not adjust the number of clusters appropriately as more data are processed. This problem is particularly acute for unstructured data like text and other non-enumerated types that are common in log and message streams and not analyzed at scale for precisely this reason. To address such issues, this paper proposes a method for hierarchical clustering using adaptive hash (AdaHash) values that can be re-calculated during a periodic batch process and used for subsequent streaming processing at the speed of data arrival, assuming sufficient distributed compute resources. We demonstrate that this method is as fast as other (optimal) hashing methods while enabling an adaptive hash function on each batch cycle within a lambda architecture computing framework.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call