Abstract
Despite the recent growth in large-scale, distributed streaming data processing, there are currently limited options for flexible clustering of streaming data on industrial production frameworks such as Mapreduce and Spark. The clustering methods being used by practitioners on such systems do not respond rapidly to new data, or do not adjust the number of clusters appropriately as more data are processed. This problem is particularly acute for unstructured data like text and other non-enumerated types that are common in log and message streams and not analyzed at scale for precisely this reason. To address such issues, this paper proposes a method for hierarchical clustering using adaptive hash (AdaHash) values that can be re-calculated during a periodic batch process and used for subsequent streaming processing at the speed of data arrival, assuming sufficient distributed compute resources. We demonstrate that this method is as fast as other (optimal) hashing methods while enabling an adaptive hash function on each batch cycle within a lambda architecture computing framework.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Data Science and Analytics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.