Abstract
Massive data can create a real competitive advantage for the companies; it is used to better respond to customers, to follow the behavior of consumers, to anticipate the evolutions, etc. However, it has its own deficiencies. This data volume not only requires big storage spaces but also makes analysis, processing and retrieval operations very difficult and hugely time-consuming. One way to overcome these problems is to cluster this data into a compact format that is still an informative version of the entire data. A lot of clustering algorithms have been proposed. However, their scaling is poor in terms of computation time whenever the size of the data gets larger. In this paper, we make full use of consensus clustering to handle Big Data clustering. We use sampling combined with a split-and-merge strategy to fragment data into small subsets, then basic partitions are locally generated from them using RHadoop's parallel processing MapReduce model and later a consensus tendency is followed to obtain the final result. A scalability analysis is conducted to demonstrate the performance of the proposed clustering models by increasing both the number of computing nodes used and the sample size while satisfying the volume and the velocity dimensions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.