Abstract

BackgroundMissing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples.ResultsWe present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests.ConclusionWe demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.

Highlights

  • Missing value estimation is an important preprocessing step in microarray analysis

  • The first group is composed of three datasets selected to represent diverse dataset types with the consideration that the integrative Missing Value Estimation (iMISS) approach is most useful for datasets with a small number samples

  • Four have been used in previous missing value estimation studies: DER7, SP.ELU14, and OGA8 were used in GOImpute [13] and DER7, SP.ELU14, and SP.ALPHA18 were used in K-nearest neighbor (KNN) imputation [6]

Read more

Summary

Introduction

Missing value estimation is an important preprocessing step in microarray analysis. several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. These algorithms can be classified into three categories: global approaches, local approaches, and hybrid approaches which are the mixture of the previous two [12] Global imputation algorithms such as singular value decomposition (SVDimpute) [6] and Bayesian principal components analysis (BPCA) [8] assume the existence of a covariance structure among all the genes or samples in the data matrix and are only suitable for datasets with strong global correlation, such as time-series datasets [8]. Many microarray datasets are non-time series or are noisy For these types of datasets, local imputation algorithms such as K-nearest neighbor (KNN) [6], least square (LSImpute) [3], local least square (LLS) [4], collateral missing value estimation (CMVE) [9], and Gaussian mixture clustering (GMCImpute) are shown to be more suitable as they can exploit the dominant local similarity structure.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call