Clustering gene expression data with a penalized graph-based metric

Ariel E Bayá,Pablo M Granitto

doi:10.1186/1471-2105-12-2

Abstract

BackgroundThe search for cluster structure in microarray datasets is a base problem for the so-called "-omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a high-dimensional space, as could be the case of some gene expression datasets.ResultsIn this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric.ConclusionsIn all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.

Highlights

The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”
Evaluation on artificial datasets In a first series of experiments we used artificial datasets to evaluate the behavior of the new metric in controlled situations, in which we change the difficulty of the clustering problem by setting, for example, the dimensionality of the input space or the distance between the clusters
This dataset simulates a problem in which all genes are still correlated, but the correlation matrix is different for each experimental condition, which leads to a better separation when using correlation as base metric

Summary

Introduction

The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”. A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets. Several problems can be faced with this technology It can be used for the identification of differentially expressed genes [1], which could highlight possible gene targets for more detailed molecular studies or drug treatments. Another application is to assign samples to known classes (class prediction) [2], using genetic profiles to improve, for example, the diagnosis of cancer patients. Dealing with high dimensional spaces is a known challenge for clustering procedures, as they usually fail to handle manifold-structured data, i.e. data that form low-dimensional, arbitrary shapes or paths through a high-dimensional space

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 4, 2011
Citations: 96	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Clustering gene expression data with a penalized graph-based metric

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

An Affinity Propagation Clustering Method Using Hybrid Kernel Function With LLE
Lin Sun ... Ruonan Liu
IEEE Access | VOL. 6
Lin Sun, et. al.Lin Sun ... Ruonan Liu
01 Jan 2018
IEEE Access | VOL. 6

Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees.
Ying Xu ... Dong Xu
Bioinformatics | VOL. 18
Ying Xu, et. al.Ying Xu ... Dong Xu
01 Apr 2002
Bioinformatics | VOL. 18

Machine-learned cluster identification in high-dimensional data
Alfred Ultsch ... Jörn Lötsch
Journal of Biomedical Informatics | VOL. 66
Alfred Ultsch, et. al.Alfred Ultsch ... Jörn Lötsch
28 Dec 2016
Journal of Biomedical Informatics | VOL. 66

Performance Analysis of Hard and Soft Clustering Approaches For Gene Expression Data
P K Nizar Banu ... S Andrews
International Journal of Rough Sets and Data Analysis | VOL. 2
P K Nizar Banu, et. al.P K Nizar Banu ... S Andrews
01 Jan 2015
International Journal of Rough Sets and Data Analysis | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering gene expression data with a penalized graph-based metric

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics