Abstract
Clustering is a technique of creating groups of objects such that each group contains similar and unique objects. One of the most popular clustering techniques is the k-means clustering algorithm. Conventional k-means techniques may not work well for high-dimensional datasets, due to the noise, discrepancies, and outliers associated with the original dataset. However, some form of transformation is required to organize the data for clustering. Four different data pre-processing methods are applied before the clustering algorithm to make the data clean, noise-free and consistent. The impact of data pre-processing on the basic k-means clustering algorithm was tested on real-life data using some normalization techniques such as z-score, mean-max, decimal scaling, and mean absolute deviation. We find that the pre-processing before clustering yields good clustering results and significantly reduces the running time compared to the traditional techniques. We can also conclude that the mean absolute deviation is the best among the four normalization methods as it captures all clustering points.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have