Abstract
Large data sets are produced by the gene expression process which is done by using the DNA microarray technology. These gene expression data are recognized as a common data source which contains missing expression values. In this paper, we present a genetic algorithm optimized k- Nearest neighbor algorithm (Evolutionary kNNImputation) for missing data imputation. Despite the common imputation methods this paper addresses the effectiveness of using supervised learning algorithms for missing data imputation. Missing data imputation approaches can be categorized into four main categories and among the four approaches, our focus is mainly on local approach where the proposed Evolutionary k- Nearest Neighbor Imputation Algorithm falls in. The Evolutionary k- Nearest Neighbor Imputation Algorithm is an extension of the common k- nearest Neighbor Imputation Algorithm which the genetic algorithm is used to optimize some parameters of k- Nearest Neighbor Algorithm. The selection of similarity matrix and the selection of the parameter value k can be identified as the optimization problem. We have compared the proposed Evolutionary k- Nearest Neighbor Imputation algorithm with k- Nearest Neighbor Imputation algorithm and mean imputation method. The three algorithms were tested using gene expression datasets. Certain percentages of values are randomly deleted in the datasets and recovered the missing values using the three algorithms. Results show that Evolutionary kNNImputation outperforms kNNImputation and mean imputation while showing the importance of using a supervised learning algorithm in missing data estimation. Even though mean imputation happened to show low mean error for a very few missing rates, supervised learning algorithms became effective when it comes to higher missing rates in datasets which is the most common situation among datasets.
Highlights
DNA microarray technology is widely used to analyze gene expression data
We present a genetic algorithm optimized knearest neighbor algorithm imputing missing data compared to the k- Nearest Neighbor Imputation Algorithm and several other common imputation methods
The training dataset should not have missing values because the evolutionary k- nearest neighbor is implemented to run an optimization process before the imputation where weight values will be assigned to each attribute of the dataset based on the importance of the attributes towards the prediction of missing value
Summary
DNA microarray technology is widely used to analyze gene expression data. These expression data sets are large and frequently found with some missing values. Given the expense of collecting data, we cannot afford to start over or to wait until wedevelop fool proof methods of gathering information[3]. As it is very time consuming and expensive to repeat the process, scientists are moving into missing data imputation as a solution [4]. The importance of using a machine learning algorithm is discussed in this paper as most of the common imputation methods such as: case deletion and mean imputation method are showing less effective results by not considering the correlation of data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal on Advances in ICT for Emerging Regions (ICTer)
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.