Abstract

Vast spread of computing technologies has led to abundance of large data sets. Today tech companies like, Google, Facebook, Twitter and Amazon handle big data sets and log terabytes, if not petabytes, of data per day. Thus, there is a need to find similarities and define groupings among the elements of these big data sets. One of the ways to find these similarities is data clustering. Currently, there exist several data clustering algorithms which differ by their application area and efficiency. Increase in computational power and algorithmic improvements have reduced the time for clustering of big data sets. But it usually happens that big data sets can't be processed whole due to hardware and computational restrictions. In this paper, the classic k-means clustering algorithm is compared to the proposed batch clustering (BC) algorithm for the required computation time and objective function. The BC algorithm is designed to cluster large data sets in batches but maintain the efficiency and quality. Several experiments confirm that batch clustering algorithm for big data sets is more efficient in using computational power, data storage and results in better clustering compared to k-means algorithm. The experiments are conducted with the data set of 2 (two) million two-dimensional data points.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.