Abstract

Correlation analysis is an important concept for studying patterns in data and making predictions. There have been many interesting revelations by applying this concept as patterns emerg out of seemingly unrelated data. In this paper, the focus is on exploring the role of correlation analysis in data clustering. We propose an algorithm, that defines an intuitive and accurate correlation coefficient metric known as the general correlation coefficient (G). We then define a framework for agglomerative clustering, based on this metric, called G based agglomerative clustering (GBAC). This framework is validated by performing experiments using synthetic as well as real datasets. The real world dataset is taken from http://databank.worldbank.org, a high dimensional dataset on human development indicators. The objective of these evaluations is to compare the performance of the proposed framework on different types of datasets. Comparative studies are performed in order to validate the proposed metric and the clustering framework. Our approach is found to be better than the existing agglomerative clustering techniques and correlation coefficient based clusterings. It is found to be effective for small, large, as well as high dimensional data. Finally, the clusters generated using this framework are validated against the existing validation measures. It is found that GBAC generates clean, more cohesive clusters. This framework combines the predictive power of correlation coefficients with the ability of finding patterns in data obtained from agglomerative hierarchical clustering. GBAC can be applied on a wide range of clustering based applications such as social network analysis, customer segmentation, collaborative filtering, construction of biological models, etc.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.