Abstract

Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected. Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory trimming-based tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some discriminant factors are the basis for these exploratory tools.

Highlights

  • Two key questions in Clustering problems are how to choose the number of groups properly and measure the strength of group-assignments

  • For a given TCLUST clustering solution, we introduce some “confirmatory” graphical tools that will help us to evaluate the quality of the cluster assignments and the strength of the trimming decisions

  • We have proposed to take advantage of this fact when trying to choose a suitable k in Clustering problems

Read more

Summary

Introduction

Two key questions in Clustering problems are how to choose the number of groups properly and measure the strength of group-assignments. The so-called “spurious-outliers model” assumes the presence of a fraction α of the data generated by an extraneous mechanism that may be trimmed off or discarded Within this framework, the TCLUST methodology presented in Garcıa-Escudero et al (2008) is able to handle different types of constraints for the group scatter matrices which allows for addressing point b) through a restriction on the group scatter matrix eigenvalues. The result of applying the TCLUST to this data set appears in Figure 1,(a) when k = 3, α = 0 and a large value for the group scatters constraint constant c = 50 are chosen. Proposition 3 in the Appendix shows that the here presented discriminant factors consistently estimate some population discriminant factors defined for the theoretical (unknown) distribution that generates our data set

Simulated Examples
Clustering and mixture approaches
Objective
Old Faithful Geyser data
Findings
Swiss Bank Notes data
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call