Improved k-nearest neighbors approach for incomplete and contaminated gene expression datasets

H Akter,M Shahjaman,Mi Asifuzzaman,Mm Rashid,Mnh Mollah,Sms Islam

doi:10.3329/jbs.v27i0.44669

Abstract

With the rapid development of high-throughput DNA microarray technologies, researchers can measure expression profiles of thousands of genes simultaneously with low costs. These massive amounts of gene expression (GE) data often contain missing values or outliers due to various reasons of data generating process. Most of the statistical methods were developed based on complete dataset. As a result, for subsequent analysis using incomplete dataset, these methods strongly suffer and we cannot find our target. A numerous methods have been developed to impute missing values and they are available in the literature. Albeit, missing values imputation and outliers handling both are equally important for analyzing GE, most of the methods perform these tasks separately and produce misleading results. Therefore, in this paper, an attempt is made to develop a new hybrid approach which is robust against outliers and missing values, simultaneously. We demonstrate the performance of the proposed method in a comparison of popular missing value imputation method K-NN while performing feature selection using both simulated and real GE datasets. The Results obtain from simulated as well as real data studies show that the proposed method outperforms K-NN in presence of different percentages of missing values and outliers. On the other hand, in absence of outliers with missing values, the proposed method keeps equal performance with the other methods. J. bio-sci. 27: 31-41, 2019

Full Text