Abstract

BackgroundData on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective.ResultsFor coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones.ConclusionFor multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2.

Highlights

  • Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease

  • Monte Carlo simulations Before we illustrate the performance of our sample size determination method based on Area Under the ROC curve (AUC) or Volume Under the ROC hyper-Surface (VUS), we present results from an extensive Monte Carlo simulation study conducted to verify the accuracy of the approximations for AUC(n) and VUS(n), respectively, and study their behavior as a function of n and other parameters

  • We have considered the two commonly used scalar performance measures, the Area Under the Receiver Operating Characteristic (ROC) curve (AUC) and the Volume Under the ROC hyper-Surface (VUS), which allow classifiers to be compared independent of discrimination values

Read more

Summary

Introduction

Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. Liu et al [11] developed an optimal Bayes classifier and a linear classifier for coded SNP data from two classes, and obtained a normal approximation to the probability of correct classification (PCC) for each classifier. They proposed a sample size determination methodology to determine an adequate sample size, which ensures that the difference between the two approximate PCCs is below a pre-specified threshold value. Using Monte Carlo simulations, Liu et al [11] assessed the validity of their approximations

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.