A Fast K-prototypes Algorithm Using Partial Distance Computation

Byoungwook Kim

doi:10.3390/sym9040058

Abstract

The k-means is one of the most popular and widely used clustering algorithm; however, it is limited to numerical data only. The k-prototypes algorithm is an algorithm famous for dealing with both numerical and categorical data. However, there have been no studies to accelerate it. In this paper, we propose a new, fast k-prototypes algorithm that provides the same answers as those of the original k-prototypes algorithm. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, the maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with numerical attributes only. If a difference of the minimum distance and the second smallest with numerical attributes is higher than m, we can find the minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental results show that the computational performance of the proposed k-prototypes algorithm is superior to the original k-prototypes algorithm in our dataset.

Highlights

K-means algorithm is one of the simplest clustering algorithm as unsupervised learning, so that is very widely used [1]
In a k-means algorithm, the objective function is defined by the sum of square distances between an object and a cluster center
This study presents a new method of accelerating k-prototypes algorithm using partial distance computation by avoiding unnecessary distance computations between an object and cluster centers

Summary

Introduction

K-means algorithm is one of the simplest clustering algorithm as unsupervised learning, so that is very widely used [1]. K-means algorithm spends a lot of processing time for computing the distances between each of the k cluster centers and the n objects. Many researchers have worked on accelerating k-means algorithms by avoiding unnecessary distance computations between an object and cluster centers. In case of large datasets, time cost of distance calculation between all data objects and the centers is high. Convergence: it can be applied to other fast k-means algorithms to compute the distance between each cluster center and an object for numerical attributes. This study presents a new method of accelerating k-prototypes algorithm using partial distance computation by avoiding unnecessary distance computations between an object and cluster centers.

K-means

K-prototypes

Proposed Algorithm

Time Complexity

Experimental Results

Conclusions