Fast Data Reduction With Granulation-Based Instances Importance Labeling

Xiaoyan Sun,Shaofeng Yang,Lian Liu,Cong Geng

doi:10.1109/access.2018.2889122

Xiaoyan Sun, Shaofeng Yang + Show 2 more

Open Access

https://doi.org/10.1109/access.2018.2889122

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 5	License type: cc-by-nc-nd

Affiliation: China University of Mining and Technology

Abstract

Data reduction has become greatly significant prior to applying instance-based machine learning algorithms in the Big Data era. Data reduction is used to reduce the size of data sets while retaining representative data. Existing algorithms, however, suffer from heavy computational cost and in having tradeoff in size reduction rate and learning accuracy. In this paper, we propose a fast data reduction approach by using granular computing to label important instances, i.e., instances with higher contributions to the learning task. The original data set is first granulated into $K$ granules by applying $K$ -means to a mapped lower-dimension space. Then, the importance of each instance in every granule is labeled based on its Hausdorff distance. Those instances whose importance values are lower than an experimentally tuned threshold are eliminated. The presented algorithm is applied to $k$ NN classification tasks with eighteen different sizes of data sets from the UCI repository, and its outstanding performance in classification accuracy, size reduction rate, and runtime is illustrated by comparing with seven data reduction methods. The experimental results demonstrate that the proposed algorithm can greatly reduce the computational cost and achieve a higher classification accuracy when the reduction size is the same for all the compared algorithms.

Full Text