Large dataset summarization with automatic parameter optimization and parallel processing for outlier detection

Zhaoyu Shou,Simin Li

doi:10.1109/fskd.2017.8393136

Abstract

As one of the most important research problems of data analytics and data mining, outlier detection from large datasets has drawn many research attentions in recent years. In this paper, we investigate the interesting research problem of summarizing large datasets for supporting efficient local outlier detection. To summarize large datasets, efficient summarization algorithms are proposed which produce a highly compact summary of the original dataset which can be applied to detect local outliers from future similar datasets. A novel automatic parameter optimization method is proposed to produce the optimal setup of the key parameters used in the summarization algorithm. Parallel processing methods are also proposed to accelerate significantly the summarization process. The experimental evaluation results demonstrate that our proposed algorithms are highly scalable for large datasets and effective in producing dataset summary for local outlier detection.

Full Text