In this paper, we propose a scalable mining algorithm to discover contextual outliers using relevant subspaces. We develop the mining algorithm using the MapReduce programming model running on a Hadoop cluster. Relevant subspaces, which effectively capture the local distribution of various datasets, are quantified using local sparseness of attribute dimensions. We design a novel way of calculating local outlier factors in a relevant subspace with the probability density of local datasets; this new approach can effectively reflect the outlier degree of a data object that does not satisfy the distribution of the local dataset in the relevant subspace. Attribute dimensions of a relevant subspace, and local outlier factors are expressed as vital contextual information, which improves the interpretability of outliers. Importantly, the selection of ${N}$ data objects with the largest local outlier factor value is categorized as contextual outliers in our solution. To this end, our scalable mining algorithm, which incorporates the locality sensitive hashing distributed strategy, is implemented on a Hadoop cluster. The experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm using both synthetic data and stellar spectral data as experimental datasets.
Read full abstract