Abstract
In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as curse of dimensionality. Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have