Abstract

BackgroundSeveral missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes.Results and conclusionsWe performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0494-3) contains supplementary material, which is available to authorized users.

Highlights

  • Several missing value imputation methods for gene expression data have been proposed in the literature

  • As with many types of experimental data, gene expression data obtained from microarray experiments often contain missing values (MVs) [2,3,4,5]

  • Missing value imputation In the context of gene expression data, MV imputation methods usually fall into two categories [4]

Read more

Summary

Introduction

Several missing value imputation methods for gene expression data have been proposed in the literature. Different technologies can be used to measure the expression level of a gene. One of the most important is microarray technology, which allows the simultaneous measurement of the expression levels of thousands of genes [1]. As with many types of experimental data, gene expression data obtained from microarray experiments often contain missing values (MVs) [2,3,4,5]. This can occur for several reasons: insufficient resolution, image corruption, fabrication errors, poor hybridization, or contaminants due to dust or scratches on the chip. Many standard methods for gene expression data analysis, including some classification and clustering techniques, require a

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call