Abstract

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

Highlights

  • Creating predictive models from data requires sophisticated machine learning, pattern recognition, and statistical modeling techniques

  • Against information-theoretic variants specialized for discrete and sparse data with available map-reduce implementations, PFBP performs worse in terms of running time, it is still applicable and practical to apply to large datasets

  • As a side product of the experiments, we compared two logistic regression algorithms, namely SparkLR that is available in MLlib and fits in a parallelized fashion a global logistic regression model, and CombLR that combines the coefficients of local logistic regression models

Read more

Summary

Introduction

Creating predictive models from data requires sophisticated machine learning, pattern recognition, and statistical modeling techniques. By removing irrelevant as well as redundant (related to the concept of weakly relevant) features (John et al 1994), FS essentially facilitates the learning task. It results in predictive models with fewer features that are easier to inspect, visualize, understand, and faster to apply. In each forward Iteration, FBS selects the feature that provides the largest increase in terms of predictive performance for T , and adds it to the set of selected variables, denoted with S hereon, starting from the empty set. We use the terms Phase to refer to the forward and backward

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call