A New Procedure of Clustering Based on Multivariate Outlier Detection

G S David Sam Jayakumar,Bejoy John Thomas

doi:10.6339/jds.2013.11(1).1091

Abstract

Clustering is an extremely important task in a wide variety of ap plication domains especially in management and social science research. In this paper, an iterative procedure of clustering method based on multivariate outlier detection was proposed by using the famous Mahalanobis distance. At first, Mahalanobis distance should be calculated for the entire sample, then using T 2 -statistic fix a UCL. Above the UCL are treated as outliers which are grouped as outlier cluster and repeat the same procedure for the remaining inliers, until the variance-covariance matrix for the variables in the last cluster achieved singularity. At each iteration, multivariate test of mean used to check the discrimination between the outlier clusters and the inliers. Moreover, multivariate control charts also used to graphically visual izes the iterations and outlier clustering process. Finally multivariate test of means helps to firmly establish the cluster discrimination and validity. This paper employed this procedure for clustering 275 customers of a famous two wheeler in India based on 19 different attributes of the two wheeler and its company. The result of the proposed technique confirms there exist 5 and 7 outlier clusters of customers in the entire sample at 5% and 1% significance level respectively.

Highlights

Introduction and Related WorkOutliers are the set of objects that are considerably dissimilar from the remainder of the data (Han, 2006)
Different approaches have been proposed to Clustering based on Multivariate Outlier Detection detect outliers, and a good survey can be found in (Knorr, 1998; Knorr, 2000; Hodge, 2004)
That the case is different for the Partitioning Around Medoids (PAM) algorithm (Kaufman and Rousseeuw, 1990)

Summary

Introduction

Introduction and Related WorkOutliers are the set of objects that are considerably dissimilar from the remainder of the data (Han, 2006). Several clustering-based outlier detection techniques have been developed Most of these techniques rely on the key assumption that normal objects belong to large and dense clusters, while outliers form very small clusters (Loureiro, 2004; Niu, 2007). It has been argued by many researchers whether clustering algorithms are an appropriate choice for outlier detection. PAM is more robust than the k-means algorithm in the presence of noise and outliers This is because the medoids produced by PAM are robust representations of the cluster centers and are less influenced by outliers and other extreme values than the means (Laan, 2003; Kaufman and Rousseeuw, 1990; Dudoit and Fridlyand, 2002). Note that our approach can be implemented when compare to other clustering algorithms that are based on PAM, such as CLARA (Kaufman and Rousseeuw, 1990), CLARANS (Ng and Han, 1994) and CLATIN (Zhang and Couloigner, 2005)

Objectives

Results

Conclusion