A New-Fangled FES-k-Means Clustering Algorithm for Disease Discovery and Visual Analytics

Tonny J Oyana

doi:10.1155/2010/746021

Abstract

The central purpose of this study is to further evaluate the quality of the performance of a new algorithm. The study provides additional evidence on this algorithm that was designed to increase the overall efficiency of the original k-means clustering technique-the Fast, Efficient, and Scalable k-means algorithm (FES-k-means). The FES-k-means algorithm uses a hybrid approach that comprises the k-d tree data structure that enhances the nearest neighbor query, the original k-means algorithm, and an adaptation rate proposed by Mashor. This algorithm was tested using two real datasets and one synthetic dataset. It was employed twice on all three datasets: once on data trained by the innovative MIL-SOM method and then on the actual untrained data in order to evaluate its competence. This two-step approach of data training prior to clustering provides a solid foundation for knowledge discovery and data mining, otherwise unclaimed by clustering methods alone. The benefits of this method are that it produces clusters similar to the original k-means method at a much faster rate as shown by runtime comparison data; and it provides efficient analysis of large geospatial data with implications for disease mechanism discovery. From a disease mechanism discovery perspective, it is hypothesized that the linear-like pattern of elevated blood lead levels discovered in the city of Chicago may be spatially linked to the city's water service lines.

Highlights

Clustering delineates operation for objects within a dataset having similar qualities into homogeneous groups [1]
The pseudo code for this hybrid approach primarily comprises the k-d tree data structure that enhances the nearest neighbor query, the original k-means algorithm, and an adaptation rate proposed by Mashor
The FES-k-means algorithm uses a hybrid approach that comprises the k-d tree data structure [21], nearest neighbor query for the k-d tree [35], the original k-means algorithm [3], and an adaptation rate proposed by Mashor [18]

Summary

Introduction

Clustering delineates operation for objects within a dataset having similar qualities into homogeneous groups [1]. It allows for the discovery of similarities and differences among patterns in order to derive useful conclusions about them [2]. Determining the structure or patterns within data is a significant component in classifying and visualizing, which allows for geospatial mining of high-volume datasets. The primary function of the k-means algorithm is to partition data into k disjoint subgroups, and the quality of these clusters is measured via different validation methods. The original k-means method, is reputable for being feeble in three major areas: (1) computationally expensive for large-scale datasets; (2) cluster initialization a priori; and (3) local minima search problem [4, 5]

Objectives

Methods

Results

Conclusion