Abstract

BackgroundThe genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss.ResultsIn this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization.ConclusionThe ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.

Highlights

  • The genomics data analysis has been widely used to study disease genes and drug targets

  • We develop an ensemble method for missing value imputations

  • In this paper, an ensemble method has been proposed for missing value imputation by constructing a set of base imputation methods and combining them

Read more

Summary

Introduction

The genomics data analysis has been widely used to study disease genes and drug targets. The existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Zhu et al BMC Bioinformatics (2021) 22:188 datasets suffer from missing values, which greatly hinder the use of gene data and the mining of effective gene information [6,7,8,9]. Ignoring the rows or columns with missing entries of a matrix of gene data is another optional method in further analysis. This results in the significant loss of useful gene information. As a necessary preprocess operation, missing data imputation is extensively performed before analyzing the microarray data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call