Abstract
Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected. Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory trimming-based tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some discriminant factors are the basis for these exploratory tools.
Highlights
Two key questions in Clustering problems are how to choose the number of groups properly and measure the strength of group-assignments
For a given TCLUST clustering solution, we introduce some “confirmatory” graphical tools that will help us to evaluate the quality of the cluster assignments and the strength of the trimming decisions
We have proposed to take advantage of this fact when trying to choose a suitable k in Clustering problems
Summary
Two key questions in Clustering problems are how to choose the number of groups properly and measure the strength of group-assignments. The so-called “spurious-outliers model” assumes the presence of a fraction α of the data generated by an extraneous mechanism that may be trimmed off or discarded Within this framework, the TCLUST methodology presented in Garcıa-Escudero et al (2008) is able to handle different types of constraints for the group scatter matrices which allows for addressing point b) through a restriction on the group scatter matrix eigenvalues. The result of applying the TCLUST to this data set appears in Figure 1,(a) when k = 3, α = 0 and a large value for the group scatters constraint constant c = 50 are chosen. Proposition 3 in the Appendix shows that the here presented discriminant factors consistently estimate some population discriminant factors defined for the theoretical (unknown) distribution that generates our data set
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have