Subspace selection in high-dimensional big data using genetic algorithm in apache spark

Fatemeh Cheraghchi,Bijan Raahemi,Arash Iranzad

doi:10.1145/3018896.3018950

Fatemeh Cheraghchi, Bijan Raahemi + Show 1 more

https://doi.org/10.1145/3018896.3018950

Copy DOI

Export

Save

Cite

Publication Date: Mar 22, 2017

Citations: 1

Affiliation: University of Ottawa

Abstract
Full-Text
Similar Papers

Abstract

Listen

In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as curse of dimensionality. Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.

Full Text