Abstract

Clustering is the task of dividing a data-set into different groups, called clusters, based on similarity. Despite being extensively studied, many state-of-the-art clustering algorithms lack interpretability. While it is useful to partition objects into clusters, it can be equally useful to understand why each cluster has been created. This motivates the desire for a clustering algorithm which can explain its partition. Our proposed method is a clustering process which seeks to explain which variables of a data-set are responsible for each cluster. We analyze the shape of the data, through the mathematical concept of a topological space. A space which is optimal for clustering is one which contains several disconnected islands, quantified as the number of connected components. Subsets of variables can then be selected and the resulting connected components of the topological space calculated. If this space is promising then the resulting disconnected region is the set of data-points which make up a cluster, and the subset of variables are what define it. We chose the variables to consider by grouping them via their correlation or through complex methods, e.g evolutionary algorithms. Our method provides a simple explanation since we can confidently assert that particular variables are why a data point is in a certain cluster. Alongside test data for comparison with existing algorithms, our methodology was applied to a community-based adolescents lipidomics dataset. Results on this dataset revealed 3 distinct clusters which can be explained by the distinct set of lipids that define them. Further optimization showed the existence of a cluster defined by a single lipid.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call