Abstract
The existence of missing values in microarray data inevitably hinders downstream biological analyses that expect complete data as input, therefore how to effectively explore the underlying structure of data to accurately estimate missing entries remains crucial and meaningful. In this study, we formalize the problem under a regularized sparse framework and accordingly propose local learning-based imputation models to capture the relationships that are hidden in gene expression profiles towards better imputation. Specifically, in view of the simultaneous variable selection and grouping effect of the elastic net penalty, we present an elastic net regularized local least squares-based imputation method to estimate the missing entries of a target gene with its neighbors. Besides, we investigate different similarity filtering metrics to select neighbor genes and develop another four imputation methods under the framework. Furthermore, the proposed methods process the target genes in ascending order of their associated missing rates. Finally, extensive comparative experiments against other eight commonly-used methods are conducted on multiple microarray datasets having varying missing rates. Results indicate the power of sparse regularization techniques and the superiority of elastic net over its competitors in terms of statistical analysis metrics.
Highlights
It has been known to us that DNA microarray technology provides researchers a high-throughput way to efficiently obtain the gene expression levels of a certain disease from different environments, subjects, tissues, and cell cycles and that microarray data analysis greatly facilitates the identification of disease genes and the diagnosis of cancers and tumor subtypes [1], [2]
(1) We present a regularized sparse framework to impute missing entries of microarray data and propose an elastic net regularized local least squares-based imputation method to capture the relationships between a target gene and its neighbors
Accurately estimating the missing values of microarray data plays a crucial role in fully utilizing a collection of gene expression profiles and facilitating downstream analyses, it remains a challenging yet rewarding research topic
Summary
It has been known to us that DNA microarray technology provides researchers a high-throughput way to efficiently obtain the gene expression levels of a certain disease from different environments, subjects, tissues, and cell cycles and that microarray data analysis greatly facilitates the identification of disease genes and the diagnosis of cancers and tumor subtypes [1], [2]. In contrast to the above methods, global learning-based methods adopt a data-driven strategy to estimate missing values under the assumption that a covariance structure exists in the obtained microarray dataset [14] Such methods generally perform well on gene expression profiles with a large size, but they suffer from performance degradation in the case where the global covariance structure does not exist or local structures dominate [15]. (1) We present a regularized sparse framework to impute missing entries of microarray data and propose an elastic net regularized local least squares-based imputation method to capture the relationships between a target gene and its neighbors. This helps utilize the latent local structure of data and reduce the risk of overfitting.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have