RHDOFS: A Distributed Online Algorithm Towards Scalable Streaming Feature Selection

Chuan Luo,Sizhao Wang,Jiancheng Lv,Hongmei Chen,Zhang Yi,Tianrui Li

doi:10.1109/tpds.2023.3265974

Abstract

Feature selection is an important topic in data mining and machine learning, which aims to select an optimal feature subset for building effective and explainable prediction models. This paper introduces Rough Hypercuboid based Distributed Online Feature Selection (RHDOFS) method to tackle two critical challenges of Volume and Velocity associated with Big Data. By exploring the class separability in the boundary region of rough hypercuboid approach, a novel integrated feature evaluation criterion is proposed by examining not only the explicit patterns contained in the positive region but also the useful implicit patterns derived from the boundary region. An efficient online feature selection method for streaming feature scenario is developed to identify relevant and nonredundant features in an incremental iterative fashion. Furthermore, a parallel optimization mechanism by combining both data and computational independence is further employed to accelerate the original sequential implementation. An efficient distributed online feature selection algorithm is presented and implemented on the Apache Spark platform to scale for massive amount of data by exploiting the computational capabilities of multicore clusters. Encouraging results of extensive experiments indicate the superiority and notable advantages of the proposed algorithm over the relevant and representative online feature selection algorithms. Empirical tests on scalability and extensibility also demonstrate our distributed implementation significantly reduces the computational times requirements while maintaining the prediction accuracy, and is capable of scaling well in volume of data and number of computing nodes.

Full Text