A greedy feature selection algorithm for Big Data of high dimensionality

Ioannis Tsamardinos,Vassilis Christophides,Pavlos Katsogridakis,Giorgos Borboudakis,Polyvios Pratikakis

doi:10.1007/s10994-018-5748-7

Abstract

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

Highlights

Creating predictive models from data requires sophisticated machine learning, pattern recognition, and statistical modeling techniques
Against information-theoretic variants specialized for discrete and sparse data with available map-reduce implementations, PFBP performs worse in terms of running time, it is still applicable and practical to apply to large datasets
As a side product of the experiments, we compared two logistic regression algorithms, namely SparkLR that is available in MLlib and fits in a parallelized fashion a global logistic regression model, and CombLR that combines the coefficients of local logistic regression models

Summary

Introduction

Creating predictive models from data requires sophisticated machine learning, pattern recognition, and statistical modeling techniques. By removing irrelevant as well as redundant (related to the concept of weakly relevant) features (John et al 1994), FS essentially facilitates the learning task. It results in predictive models with fewer features that are easier to inspect, visualize, understand, and faster to apply. In each forward Iteration, FBS selects the feature that provides the largest increase in terms of predictive performance for T , and adds it to the set of selected variables, denoted with S hereon, starting from the empty set. We use the terms Phase to refer to the forward and backward

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Machine Learning	Publication Date: Aug 7, 2018
Citations: 50	License type: open-access

R Discovery Prime

R Discovery Prime

A greedy feature selection algorithm for Big Data of high dimensionality

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Learning

Lead the way for us

Similar Papers

Decision letter: Applying causal discovery to single-cell analyses using CausalCell
Babak Momeni ... Anna Akhmanova
-
Babak Momeni, et. al.Babak Momeni ... Anna Akhmanova
14 Aug 2022
14 Aug 2022

Author response: Applying causal discovery to single-cell analyses using CausalCell
Yujian Wen ... Hao Zhu
-
Yujian Wen, et. al.Yujian Wen ... Hao Zhu
23 Aug 2022
23 Aug 2022

A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring
Fatemeh Nemati Koutanaei ... Mohammad Khanbabaei
Journal of Retailing and Consumer Services | VOL. 27
Fatemeh Nemati Koutanaei, et. al.Fatemeh Nemati Koutanaei ... Mohammad Khanbabaei
16 Jul 2015
Journal of Retailing and Consumer Services | VOL. 27

Network Intrusion Detection Using Wrapper-based Decision Tree for Feature Selection
Mubarak Albarka Umar ... Yan Liu
-
Mubarak Albarka Umar, et. al.Mubarak Albarka Umar ... Yan Liu
14 Jan 2020
14 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A greedy feature selection algorithm for Big Data of high dimensionality

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Learning