K-Means Cluster Analysis for Missing Data

Juwon Song

doi:10.37727/jkdas.2017.19.2.689

Abstract

군집분석은 유사한 특성을 지닌 관측치들을 동일한 그룹으로 분류하는 분석 기법이다. k-평균군집분석은 관측치들과 군집 평균의 유클리디언 거리의 합을 최소화하는 그룹을 찾는 최적화 기법을 통해 자료를 군집으로 분류한다. 실제 자료의 경우 일부 변수에서 결측이 발생하는 경우가 흔하며 결측을 포함한 자료에 대하여 군집분석을 실시하는 경우 결측이 발생한 관측치를 제거한 후 분석을 실시하는 것이 일반적이다. 하지만 이 경우 결측이 발생한 자료는 어느 군집에도 할당할 수 없고 각 그룹의 평균의 추정에 편향이 발생할 가능성이 높다. 따라서 결측치를 포함한 자료를 군집분석에 포함하기 위하여 흔히 사용되는 방법은 결측값에 대해 대체를 실시한 후 대체된 자료에 대하여 군집분석을 실시하는데 이 경우 군집 정보를 포함하지 않고 대체를 실시하는 단점을 지닌다. 따라서 본 연구에서는 결측치에 대한 대체를 실시할 때 군집 정보를 이용하여 대체하는 방법을 제안한다. 모의실험을 통해 본 연구에서 제안한 방법을 군집 정보를 포함하지 않고 대체를 실시한 후 군집분석을 실시하는 경우와 비교하였는데 본 연구에서 제안한 대체방법이 더 나은 결과를 보였다.Cluster analysis is an analysis technique to classify observations with similar characteristics into the same cluster. The k-means cluster analysis conducts grouping of observations based on an optimization method minimizing the sum of Euclidean distances between observations and their cluster centers. In real data, missing values often occur in some variables, and when cluster analysis is conducted for missing data, it is common to exclude observations with missing values. However, in this case, missing values cannot be classified into any group, and it may cause biases in estimating cluster centers. Therefore, to include observations with missing values in cluster analysis, it is often to impute missing values and conduct cluster analysis using imputed data. A disadvantage of this imputation approach is to conduct imputation without using cluster information. In this study, we propose methods to impute missing values using cluster information. Simulation is conducted to compare performance of the suggested imputation method with the one based on imputation without using cluster information. The proposed imputation method provides better results than the one ignoring cluster information.

Full Text