Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

Argiris Sakellariou,George Spyrou,Despina Sanoudou

doi:10.1186/1471-2105-13-270

Abstract

BackgroundA feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.ResultsWe propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases.ConclusionsmAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.

Highlights

A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size
Rationale for selecting the proposed approach Jaeger et al [16] claimed that ranking algorithms produce lists of genes, where the top ranked genes are highly correlated with each other, mainly because they belong to the same pathway
Regarding the number of genes, we employ a clustering index to determine the ‘actual’ number of representative genes. This differs from mRMR method, which iterates in its ranked gene list before concluding to a subset, and from Jaeger and Hanczar, where the resultant subset is driven by the initial number of potential clusters, which is set arbitrarily

Summary

Introduction

A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Informative genes are selected according to a two-sample statistical test combined with multiple testing procedures to guard against Type 1 errors [1]. A wide variety of FS algorithms has been proposed [3,4,5] and depending on how they combine the feature selection search with the construction of the classification model they can be classified into 3 categories: filter, wrapper, and embedded [2]. Embedded algorithms, like Random Forest [14], select the best subset of genes incorporating the classifiers’ bias [2]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 17, 2012
Citations: 83	License type: cc-by

R Discovery Prime

R Discovery Prime

Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data.
Lokeswari Venkataramana ... Shomona Gracia Jacob
Genes & genomics | VOL. 41
Lokeswari Venkataramana, et. al.Lokeswari Venkataramana ... Shomona Gracia Jacob
19 Aug 2019
Genes & genomics | VOL. 41

HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data
Y Wang ... F S Makedon
Bioinformatics | VOL. 21
Y Wang, et. al.Y Wang ... F S Makedon
07 Dec 2004
Bioinformatics | VOL. 21

Identification of disease-critical genes causing preeclampsia: Meta-heuristic approaches
Jahnabi Dutta ... Surama Biswas
-
Jahnabi Dutta, et. al.Jahnabi Dutta ... Surama Biswas
01 Dec 2015
01 Dec 2015

Informative Feature Clustering and Selection for Gene Expression Data
Yuqi Yang ... Zhihang Luo
IEEE Access | VOL. 7
Yuqi Yang, et. al.Yuqi Yang ... Zhihang Luo
01 Jan 2019
IEEE Access | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics