Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.

Yu Guo,Raji Balasubramanian,Robert N Mcburney,Armin Graber

doi:10.1186/1471-2105-11-447

Abstract

BackgroundData generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.ResultsThe analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper.ConclusionNo single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.

Highlights

Data generated using ‘omics’ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study
In the second sub-section (’Simulation Results’), we present results from simulation studies comparing the performance of classifiers K-nearest neighbor (KNN), Prediction Analysis of Microarrays (PAM), Random Forests (RF) and Support Vector Machines (SVM) in highdimensionality data settings
Simulation Results We conducted simulation studies comparing the performance of the classifiers KNN, Prediction Analysis for Microarrays (PAM), RF and SVM, in settings in which the number of features exceeded the number of subjects in the study

Summary

Introduction

Data generated using ‘omics’ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. Recent examples of applications include a study to identify differential gene expression patterns to distinguish different sub-classes of pediatric and adult leukemia [1] and a proteomic study to detect serum based biomarkers for the diagnosis of head and neck cancers [2] Such experiments typically algorithm based on the selected subset of features that can be used to predict a subject’s class. Several classifiers are commonly used in the analysis of ‘omics’ data, including Random Forests [3], Prediction Analysis for Microarrays [4], K-nearest neighbor classification [5] and Support Vector Machines [6] Each of these classifiers involves complex algorithms based on a variety of assumptions, their relative performance is naturally expected to vary depending on the application and the nature of the data

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Sep 3, 2010
Citations: 98	License type: cc-by

R Discovery Prime

R Discovery Prime

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings
Yu Guo ... Raji Balasubramanian
The International Journal of Biostatistics | VOL. 8
Yu Guo, et. al.Yu Guo ... Raji Balasubramanian
28 Jan 2012
The International Journal of Biostatistics | VOL. 8

Comparative Study of Classification Algorithms for Various DNA Microarray Data.
Jingeun Kim ... Yourim Yoon
Genes | VOL. 13
Jingeun Kim, et. al.Jingeun Kim ... Yourim Yoon
11 Mar 2022
Genes | VOL. 13

A Comparative Study of Classification Algorithms for Predicting Liver Disorders
Rashi Bhardwaj ... Rajat Mehta
-
Rashi Bhardwaj, et. al.Rashi Bhardwaj ... Rajat Mehta
17 Dec 2019
17 Dec 2019

Effects of Pooling Samples on the Performance of Classification Algorithms: A Comparative Study
Kanthida Kusonmano ... Klaus R Liedl
The Scientific World Journal | VOL. 2012
Kanthida Kusonmano, et. al.Kanthida Kusonmano ... Klaus R Liedl
01 Jan 2012
The Scientific World Journal | VOL. 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics