Abstract

This paper introduces the Clustering method as an unsupervised machine learning where the input and the output data are unlabeled. Many algorithms are designed to solve clustering problems and many approaches were developed to enhance deficiency or to seek efficiency and effectiveness. These approaches are partitioning-based, hierarchical-based, density-based, grid-based, and model-based. With the evolution of data amounts in every second, we become faced to deal with what is called big data that compelled researchers to develop the algorithms based on these approaches in order to adjust them to manage warehouses in a fast way. Our main purpose is the comparative of representative algorithms of each approach that respect most of the big data criterions which are called the 4Vs. The comparison aims to figure out which algorithms could mine efficiently information by clustering big data. The studied algorithms are FCM, CURE, OPTICS, BANG, and EM respectively from each approach aforementioned. Assessing these algorithms based on the 4Vs big data criterions which are Volume, Variety, Velocity and Value shows some deficiency in some of them. All trained algorithms clusters well large datasets but exclusively FCM and OPTICS algorithms suffer from the curse of dimensionality. FCM and EM algorithms are very sensitive to outliers which affect badly the results. FCM, CURE, and EM algorithms require the number of clusters as input which plays a deficiency if the optimal one wasn’t chosen. FCM and EM algorithms give spherical shapes of clusters unlike CURE, OPTICS, and BANG algorithms which give arbitrary ones that play an advantage for cluster quality. FCM algorithm is the fastest in performing big data, unlike EM algorithm that takes the longest time in training. For diversity in types of data CURE algorithm trains both numerical and categorical data types. Consequently, the analysis leads us to conclude that both CURE and BANG are efficient in clustering big data but we noticed that CURE lacks a bit of accuracy in data assignment. Therefore we infer to qualify the BANG algorithm to be the appropriate one to cluster a large dataset with high dimensionality and noise within it. BANG algorithm is based on a grid structure but comprises implicitly partitioning, hierarchical and density approaches the reason behind its efficiency in giving good accurate results. But even so, the ultimate accuracy in clustering isn’t reached yet but almost close. The conclusion we observe from the BANG algorithm should be applied to more algorithms by mixing approaches in order to attain the ultimate accuracy and effectiveness that lead consequently to accurate future decisions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.