OUTLIERS DETECTION WITH CORRELATED SUBSPACES FOR HIGH DIMENSIONAL DATASETS

JINSONG LENG,ZHIHU HUANG

doi:10.1142/s0219691311004067

Abstract

Detecting outliers in high dimensional datasets is quite a difficult data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Due to the existence of many irrelevant dimensions in high dimensional datasets, it is of great importance to eliminate the irrelevant or unimportant dimensions and identify outliers in interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques and subspace-based clustering methods. The dimension-growth subspace clustering techniques find interesting subspaces in relatively lower possible dimension space, while dimension-growth approaches intend to find the maximum cliques in high dimensional datasets. This paper presents a novel approach by identifying outliers in correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ the frequent pattern algorithms to find the correlated subspaces. Based on the correlated subspaces obtained, outliers are distinguished from the projected subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional datasets.

Full Text