A Distributed Attribute Reduction Algorithm for High-Dimensional Data under the Spark Framework

Zhengjiang Wu,Junwei Luo,Yaning Zhang,Qiuyu Mei,Tian Yang

doi:10.1007/s44196-022-00076-7

Zhengjiang Wu, Junwei Luo + Show 3 more

Open Access

https://doi.org/10.1007/s44196-022-00076-7

Copy DOI

Abstract

Attribute reduction is an important issue in rough set theory. However, the rough set theory-based attribute reduction algorithms need to be improved to deal with high-dimensional data. A distributed version of the attribute reduction algorithm is necessary to enable it to effectively handle big data. The partition of attribute space is an important research direction. In this paper, a distributed attribution reduction algorithm based on cosine similarity (DARCS) for high-dimensional data pre-processing under the Spark framework is proposed. First, to avoid the repeated calculation of similar attributes, the algorithm gathers similar attributes based on similarity measure to form multiple clusters. And then one attribute is selected randomly as a representative from each cluster to form a candidate attribute subset to participate in the subsequent reduction operation. At the same time, to improve computing efficiency, an improved method is introduced to calculate the attribute dependency in the divided sub-attribute space. Experiments on eight datasets show that, on the premise of avoiding critical information loss, the reduction ability and computing efficiency of DARCS have been improved by 0.32 to 39.61% and 31.32 to 93.79% respectively compared to the distributed version of attribute reduction algorithm based on a random partitioning of the attributes space.

Full Text