Attribute reduction is widely employed to improve the efficiency and accuracy of data analysis by eliminating redundant and irrelevant attributes from datasets. However, with the emergence of growing big data, the sequential execution of such algorithms becomes time-consuming and requires distributed computing capabilities to achieve scalable parallelization. This study proposes a novel attribute reduction algorithm for neighborhood decision systems. We introduce two novel metrics—the neighborhood evidential conflict degree (NECD) and neighborhood evidential conflict rate (NECR)—to compute heterogeneity between samples in the neighborhood and assess the significance of attributes in the feature space, respectively. These metrics assess the quality and selection of attribute subsets in attribute reduction, improving classification accuracy and computational efficiency. We also develop a sequentially forward selection attribute reduction method to select a feature subset through the defined NECR. Finally, we develop a distributed attribute reduction algorithm implemented in Apache Spark. Our approach involves a two-phase Map-Reduce process for K-Nearest Neighbors search, evidence combination, and NECR computation. NECR, as a measure of feature subset quality, enhances the feature subset's decision approximation capability of the data. Experimental results on small and large datasets demonstrate that the proposed algorithm outperforms benchmarking algorithms regarding classification accuracy and computational efficiency.
Read full abstract