Abstract

Gene expression microarrays are the most commonly available source of high-throughput biological data. They are widely employed for studying many different aspects of gene regulation and function, ranging from understanding the global cell-cycle control of microorganisms to cancer in humans. Gene expression microarray experiments often generate data sets with multiple missing values. Many algorithms for gene expression data analysis require a complete data matrix and therefore, the accurate estimation of missing entries is crucial for their optimal usage. The latter has driven the development of various microarray imputation methods. However, most of these approaches are not particularly suitable for time series expression profiles. Moreover, their performance is not satisfactory for datasets with high rates of missing data or small numbers of samples. Another drawback of all these methods is that their estimation is based solely on a single expression matrix and no other additional data sources to impute the missing entries are used. Motivated by these, we propose herein an imputation algorithm that is particularly suited for the estimation of missing values in gene expression time series data using information that is contained in multiple related data sets. The proposed algorithm initially identifies an appropriate set of estimation matrices by using the Dynamic Time Warping (DTW) distance in order to measure similarities between gene expression matrices. Next it employs the same distance measure to evaluate the similarity between gene expression profiles and further applies a hybrid aggregation algorithm to combine the inter-gene similarities across the selected matrices in order to identify estimation genes. Then the expression profiles of those estimation genes are used to obtain the final imputation. The estimation accuracy of the proposed algorithm, called Integrative DTW-based Imputation (IDTWimpute), is benchmarked against that of two other imputation methods (KNNimpute and DTWimpute) in terms of root mean squared difference. In addition, the impact of the three methods on the quality of gene clustering is evaluated by using k-means and k-medoids clustering algorithms and two different cluster validation measures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call