BackgroundClustering on projected data is common in biomedical research analysis. Principal component analysis (PCA) is widely used for projection, focusing on data dispersion (variance), while clustering identifies data concentrations (neighborhood). These are conflicting aims. This study re-evaluates combinations of PCA and other projection methods with common clustering algorithms. MethodsSix projection methods (PCA, ICA, isomap, MDS, t-SNE, UMAP) were combined with five clustering algorithms (k-means, k-medoids, single link, Ward's method, average link). Projections and clusterings were evaluated using a numerical criterion for evaluating clustering performance and a visual criterion based on plotting the projected data on a Voronoi tessellation plane with class-wise coloring. Nine artificial and five real biomedical datasets were analyzed. ResultsNo combination consistently captured prior classifications in projections and clusters. Visual inspection proved essential. PCA was often but not always outperformed or equaled by neighborhood-based methods (UMAP, t-SNE) and manifold learning techniques (isomap). ConclusionsThe results dissaprove PCA as a standard projection method prior to clustering. Therefore, method selection should be data specific as a tailored approach to data projection and clustering in biomedical analysis. To aid this process, we propose a novel visualization technique that combines Voronoi tessellation with color coding.
Read full abstract