Abstract

Attribute reduction is nowadays an extremely important data preprocessing technique in the field of data mining, which has gained much attention due to its ability to provide better generalization performance and learning speed for analysis model. Rough set theory offers a systematic powerful framework for attribute reduction in terms of classificatory and decision abilities under uncertainty. In this paper, we present a parallel neighborhood entropy-based attribute reduction method with neighborhood rough sets that uses the Apache Spark cluster computing model to realize the parallelization of algorithm in a distributed computing environment. In leveraging the horizontal partitioning strategy to alleviate the task of data parallelism, three quantitative measures of attribute sets, i.e., neighborhood approximation accuracy, neighborhood credibility and coverage degrees are parallelized to accelerate the computation of decision neighborhood entropy during the heuristic search iterative process. A novel parallel heuristic attribute reduction algorithm is then developed by employing several operations from Spark API to ease the code parallelization. Extensive experimental results indicate the superiority and notable strengths of the proposed algorithm in terms of the criteria for evaluating parallel performance, i.e., scalability and extensibility.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.