Abstract

BackgroundData artifacts due to variations in experimental handling are ubiquitous in microarray studies, and they can lead to biased and irreproducible findings. A popular approach to correct for such artifacts is through post hoc data adjustment such as data normalization. Statistical methods for data normalization have been developed and evaluated primarily for the discovery of individual molecular biomarkers. Their performance has rarely been studied for the development of multi-marker molecular classifiers—an increasingly important application of microarrays in the era of personalized medicine.MethodsIn this study, we set out to evaluate the performance of three commonly used methods for data normalization in the context of molecular classification, using extensive simulations based on re-sampling from a unique pair of microRNA microarray datasets for the same set of samples. The data and code for our simulations are freely available as R packages at GitHub.ResultsIn the presence of confounding handling effects, all three normalization methods tended to improve the accuracy of the classifier when evaluated in an independent test data. The level of improvement and the relative performance among the normalization methods depended on the relative level of molecular signal, the distributional pattern of handling effects (e.g., location shift vs scale change), and the statistical method used for building the classifier. In addition, cross-validation was associated with biased estimation of classification accuracy in the over-optimistic direction for all three normalization methods.ConclusionNormalization may improve the accuracy of molecular classification for data with confounding handling effects; however, it cannot circumvent the over-optimistic findings associated with cross-validation for assessing classification accuracy.

Highlights

  • Data artifacts due to variations in experimental handling are ubiquitous in microarray studies, and they can lead to biased and irreproducible findings

  • We have previously shown that cross-validation is prone to biased estimation of prediction accuracy when handling effects are pronounced in the data being analyzed, despite of the use of quantile normalization (Qin, Huang & Begg, 2016); we used external validation as the primary approach for assessing classification accuracy when evaluating the impact of normalization methods

  • Comparison of normalization methods in the presence of confounding handling effects Here we focus on the results of the simulation study using parametric method (PAM) as the classification method and external validation for assessing the misclassification error rate

Read more

Summary

Introduction

Data artifacts due to variations in experimental handling are ubiquitous in microarray studies, and they can lead to biased and irreproducible findings. Statistical methods for data normalization have been developed and evaluated primarily for the discovery of individual molecular biomarkers Their performance has rarely been studied for the development of multi-marker molecular classifiers—an increasingly important application of microarrays in the era of personalized medicine. Methods: In this study, we set out to evaluate the performance of three commonly used methods for data normalization in the context of molecular classification, using extensive simulations based on re-sampling from a unique pair of microRNA microarray datasets for the same set of samples. We have previously shown that cross-validation is prone to biased estimation of prediction accuracy when handling effects are pronounced in the data being analyzed, despite of the use of quantile normalization (Qin, Huang & Begg, 2016); we used external validation as the primary approach for assessing classification accuracy when evaluating the impact of normalization methods

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call