Abstract

Learning a classification model from a dataset with the presence of irrelevant variables makes such a task more difficult and less accurate. To address this issue, feature selection has been proposed. However, existing feature selection struggles to select accurate and relevant sub-features or subspaces. It has been demonstrated in the literature that learning the Markov boundary (MB) of features helps to extract appropriate subsets of relevant features from the full data space. However, due to the fact that almost all existing methods learn MBs with static data, a state-of-the-art algorithm for learning MBs with streaming data, called SDMB (streaming data-based MB), has recently been proposed. SDMB has successfully improved online feature selection based on learning Markov boundary (MB) from data stream for classification tasks. Nevertheless, we found that SDMB has computational redundancy and there is room for improvement in terms of runtime. Therefore, we propose an improvement to SDMB. The main drawback of SDMB is that it entirely recomputes the conditional independence/dependence tests (i.e., the dependency scores) between a target variable and other variables in each iteration of learning MBs from the full data space. This paper proposes an incremental SDMB, called FastSDMB, to address the above bottleneck by reducing the dependency score computation time of SDMB. FastSDMB avoids the computational redundancy in SDMB by using a novel incremental computing collective dependency scoring process. This incremental process retains and reuses, in current iterations, the calculated conditional independence test score of variables that have not changed from previous iterations. To demonstrate FastSDMB's effectiveness and capability of handling fast data arrival scenarios, comparisons are made first over the state-of-the-art SDMB, and then we apply FastSDMB to various classification models in order to showcase the resulting performance improvement when implemented over a wide variety of synthetic and real-world datasets. The experimental results show that FastSDMB can run up to 4.8 × faster on average than SDMB and it maintains the same accuracy when learning Markov boundary. Thereby proving its efficiency in fast relevant feature selection based on Markov boundary learning from data stream.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call