Abstract

Clustering is a popular type of unsupervised learning technique that performs natural groupings on samples to generate probably approximately correct groups. These clusters or groups are expected to have high intra-similarity and high inter-distinctiveness. Practical clustering problems may combine varying levels of intra-similarity or quality and distinctiveness or diversity. We propose an information analytic approach to measure the quality and diversity given in terms of the cluster error. Unlike previous models that show information measures derived from probability of samples in cluster, the proposed framework employs distortion-rate approach by first formulating the probability of distortion for multiple-types of samples in the cluster. In this framework, cluster formation is shown as naturally greedy action which leads to showing the average minimum distortion of cluster. We also obtain probabilistic bounds on the cluster error and present case study on use of distortion-rate approach for clustering. For limiting case of binary typical cluster, the bound is shown to resemble Fano inequality. For the first time, analytic performance limits along with notion of bias and variance are formalized for clustering.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call