Clustering of samples and variables with mixed-type data.

Manuela Hummel,Dominic Edelmann,Annette Kopp-Schneider,Zhaohong Deng

doi:10.1371/journal.pone.0188274

Manuela Hummel, Dominic Edelmann + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0188274

Copy DOI

Journal: PloS one	Publication Date: Nov 28, 2017
Citations: 48	License type: CC BY 4.0

Affiliation: German Cancer Research Center, Heidelberg University

Abstract

Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

Highlights

In real data situations various factors of interest are measured on different scales, e.g. quantitative gene expression values and categorical clinical features like gender, disease stage etc
The performance of the methods is compared by balanced error rates (BER = 0.5 Á (a12/(a11 + a12) + a21/(a21 + a22)), where aij are the entries of the classification confusion matrix), which is more suitable than the misclassification rate in case of unequal class sizes
We described ways how datasets including parameters measured on different scales can be used in cluster analysis—both individual samples and measured variables—while data do not have to be brought on the same scale, which usually would mean loss of information

Summary

Methods

There is plenty of literature on clustering samples, even for mixed numerical and categorical data, see Table 2 for an overview of the considered methods. Like latent class clustering [14], k-prototypes clustering [15], fuzzy clustering [16] and others [19], aim in partitioning the data into a fixed number of clusters, which is, especially for large datasets, computationally more efficient than hierarchical clustering, where the complete dissimilarity matrix is required. Clustering / distance method latent class clustering [14] k-prototypes [15] fuzzy clustering [16] Mahalanobis-type distance [17] Value difference Metric [18] Gower’s similarity coefficient [8] hierarchical ✓ ✓ ✓

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Clustering of samples and variables with mixed-type data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
Alina Lazar ... Ling Jin
ACM journal of data and information quality | VOL. 11
Alina Lazar, et. al.Alina Lazar ... Ling Jin
06 Mar 2019
ACM journal of data and information quality | VOL. 11

Early assessment of right ventricular systolic function after pediatric heart transplant
Jamie K Harrington ... Anna Joong
Pediatric Transplantation | VOL. 22
Jamie K Harrington, et. al.Jamie K Harrington ... Anna Joong
03 Sep 2018
Pediatric Transplantation | VOL. 22

European Heart Rhythm Association (EHRA)/Heart Rhythm Society (HRS)/Asia Pacific Heart Rhythm Society (APHRS)/Latin American Heart Rhythm Society (LAHRS) expert consensus on risk assessment in cardiac arrhythmias: use the right tool for the right outcome, in the right population
Jens Cosedis Nielsen ...
Heart rhythm | VOL. 17
Jens Cosedis Nielsen, et. al.Jens Cosedis Nielsen ...
15 Jun 2020
Heart rhythm | VOL. 17

Correspondence Analysis as a Tool in Fungal Taxonomy
T.N Sieber ... M.J Greenacre
Systematic and Applied Microbiology | VOL. 21
T.N Sieber, et. al.T.N Sieber ... M.J Greenacre
01 Aug 1998
Systematic and Applied Microbiology | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering of samples and variables with mixed-type data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one