Visualizing Clustering Results

Ian Davidson

doi:10.1007/springerreference_63712

Abstract

Non-hierarchical clustering has a long history in numerical taxonomy [13] and machine learning [1] with many applications in fields such as data mining [2], statistical analysis [3] and information retrieval [17]. Clustering involves finding a specific number of subgroups (k) within a set of s observations (data points/objects); each described by d attributes. A clustering algorithm generates cluster descriptions and assigns each observation to one cluster (exclusive assignment) or in part to many clusters (partial assignment). Throughout this paper, we shall refer to the output of a clustering algorithm as the clustering results, solution, or model. The information in a clustering solution is extensive, a mixture model or K-Means model produces k.s conditional probabilities or distances. Visualizing the clustering results can help to quickly assimilate this information and provide insights that support and complement textual descriptions or statistical summaries. For example, we quickly wish to know how well defined are the clusters, how different are they from each other, what is their size, and do the observations belong strongly to the cluster or only marginally? Visualizing a clustering solution has many potential uses. The analyst user during the highly iterative model building process can quickly obtain insights from the visualization that suggest the adequacy of the solution and what further experiments to conduct. Alternatively, the business user can examine and query the final clustering solution using the visualization. The interesting parts of a clustering solution will depend on the application. Database segmentation applications such as target marketing focus on the clusters and investigate which clusters are similar, which are autonomous and which have, for example, a high propensity to cross-sell. Anomaly detection applications attempt to identify those observations that do not “belong”, are interesting and require further investigation. The focus is the observations and we wish to know if they belong strongly or only marginally to their most likely cluster. Typical uses of anomaly detection are detecting money laundering, identifying network intrusion, and data cleaning [5]. In this paper, we describe a general particle framework to display the information in a clustering solution. Changes to the parameters of the framework can emphasize

Full Text