MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark

Chuan Luo,Qian Cao,Tianrui Li,Hongmei Chen,Sizhao Wang

doi:10.1016/j.eswa.2022.118554

Abstract

Attribute reduction is nowadays an extremely important data preprocessing technique in the field of data mining, which has gained much attention due to its ability to provide better generalization performance and learning speed for analysis model. Rough set theory offers a systematic powerful framework for attribute reduction in terms of classificatory and decision abilities under uncertainty. In this paper, we present a parallel neighborhood entropy-based attribute reduction method with neighborhood rough sets that uses the Apache Spark cluster computing model to realize the parallelization of algorithm in a distributed computing environment. In leveraging the horizontal partitioning strategy to alleviate the task of data parallelism, three quantitative measures of attribute sets, i.e., neighborhood approximation accuracy, neighborhood credibility and coverage degrees are parallelized to accelerate the computation of decision neighborhood entropy during the heuristic search iterative process. A novel parallel heuristic attribute reduction algorithm is then developed by employing several operations from Spark API to ease the code parallelization. Extensive experimental results indicate the superiority and notable strengths of the proposed algorithm in terms of the criteria for evaluating parallel performance, i.e., scalability and extensibility.

Full Text