Abstract

BackgroundMicroarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used.ResultsIn this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets.ConclusionsIn this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.

Highlights

  • Microarray data are usually peppered with missing values due to various reasons

  • We used normalized root mean squared error (NRMSE), and cluster pair proportions (CPP) and biomarker list concordance index (BLCI) to evaluate the performance of each algorithm

  • Simulation setting In our numerical experiments, thirteen real microarray datasets were used as benchmark datasets and nine algorithms including k nearest neighbors (KNN), sequential k nearest neighbors (SKNN), iterative k nearest neighbors (IKNN), LS, local least squares (LLS), iterative local-least-squares (ILLS), sequential local-leastsquares (SLLS), Bayesian principal component analysis (BPCA) and singular value decomposition (SVD) were used

Read more

Summary

Introduction

Most of the downstream analyses for microarray data require complete datasets. Gene expression microarray (DNA chip) technology is a powerful tool for modern biomedical research. It could monitor relative expression of thousands of genes under a variety of experimental conditions. The occurrence of missing values in microarray data disadvantageously influences downstream analyses, such as discovery of differentially expressed genes [11,12], construction of gene regulatory networks [13,14], supervised classification of clinical samples [15], gene cluster analysis [10,16], and biomarker detection

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.