Evaluating a Nearest-Neighbor Method to Substitute Continuous Missing Values

Eduardo R Hruschka,Nelson F F Ebecken,Estevam R Hruschka

doi:10.1007/978-3-540-24581-0_62

Eduardo R Hruschka, Nelson F F Ebecken + Show 1 more

Open Access

https://doi.org/10.1007/978-3-540-24581-0_62

Copy DOI

Abstract

This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datasets formed by continuous attributes. In the substitution process, each instance containing missing values is compared with complete instances, and the closest instance is used to assign the attribute missing value. We evaluate this method in simulations performed in four datasets that are usually employed as benchmarks for data mining methods - Iris Plants, Wisconsin Breast Cancer, Pima Indians Diabetes and Wine Recognition. First, we con- sider the substitution process as a prediction task. In this sense, we em- ploy two metrics (Euclidean and Manhattan) to simulate substitutions both in original and normalized datasets. The obtained results were compared to those provided by a usually employed method to perform this task, i.e. substitution by the mean value. Based on these simulations, we propose a substitution procedure for the well-known K-Means Clustering Algorithm. Then, we perform clustering simulations, com- paring the results obtained in the original datasets with the substituted ones. These results indicate that the proposed method is a suitable esti- mator for substituting missing values, i.e. it preserves the relationships between variables in the clustering process. Therefore, the proposed Nearest-Neighbor Method is an appropriate data preparation tool for the K-Means Clustering Algorithm.

Full Text