The Methods and Tools for Clustering Analysis

Zhaoyuan Fang

doi:10.1016/b978-0-12-801238-3.11463-1

Abstract

Cluster analysis is a widely employed branch in many domains of scientific research. To meet its growing applications in biomedical studies, this tutorial aims to provide a conceptual overview of the most important and representative clustering algorithms, pointing out their respective application prerequisites, technical issues, strengths and restrictions, as well as software implementations. These seven clustering methods are covered in each section: hierarchical clustering, k-means and related methods, mixture models, non-negative matrix factorization, spectral clustering, density-based clustering, and self-organizing maps. Hierarchical clustering and k-means are classic methods that are still in wide use and often with good performance. Some extensions of k-means have been discussed for remediation of certain practical limitations. Mixture models have a strong statistical foundation and can be estimated computationally. Non-negative matrix factorization views data as linear summations of meaningful parts, which has a natural clustering interpretation. Spectral clustering is closely related to graph cut problems and can be approximately solved mathematically. Density-based clustering emphasizes the distinct density distributions of clusters. Self-organizing map uses a neural-network to approximate the underlying data manifold with unique visualization capabilities. Each algorithm has certain parameter(s) that need to be tuned for better clustering, which will be discussed in each section. A comprehensive resource list of software packages in popular programming languages such as R, Python and Matlab has been compiled.

Full Text