A novel centroid initialization in missing value imputation towards mixed datasets

Titin Siswantining ,Devvi Sarwinda ,Taufik Anwar ,Herley Shaori Al-Ash

doi:10.28919/cmbn/5344

Titin Siswantining , Devvi Sarwinda + Show 2 more

Open Access

https://doi.org/10.28919/cmbn/5344

Copy DOI

Abstract

Currently, many databases contain missing values, especially in medical data. Statistical and data mining approaches often require complete data conditions, where these two approaches will not provide adequate performance if the data contains missing values. Several techniques have been made to overcome missing values, one of which is by deleting data containing missing values. However, this approach will omit a lot of information if the data found includes many missing values. This study used an imputation approach (filling in the missing attributes) with a clustering approach. One of the most common clustering approaches is K-Means Clustering. In K-means clustering, the value of the centroid gets from the closest observed value. In this study, we propose updating the centroid value based on the harmonic average of the distance across all observations per centroid. This method is known as K-Harmonic Means Clustering (KHM). We proposed a new program approach for a mixed dataset on three scenarios for missing values of 10%, 20%, and 30%. From the experiments conducted on experimental data sets containing missing values, we get a small proportion of missing values (10%) with a small number of clusters or K, which gives a smaller RMSE value compared to other scenarios.

Full Text