Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets

Vincenzo Lagani,Michail Tsagris,Giorgos Athineou,Alessio Farcomeni,Ioannis Tsamardinos

doi:10.18637/jss.v080.i07

Vincenzo Lagani, Michail Tsagris + Show 3 more

Open Access

https://doi.org/10.18637/jss.v080.i07

Copy DOI

Abstract

The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.

Highlights

Feature selection is one of the fundamental tasks in the area of machine learning
The process of feature or variable selection aims to identify a subset of features that are relevant with respect to a given task; for example, in regression and classification it is often desirable to select and retain only the subset of variables with the highest predictive power
Depending on the dataset and target specified by the user, the statistically equivalent signature (SES) function is able to automatically select the data analysis task to perform and the conditional independence test to use: 1. Binary classification: In a binary classification task the objective of the analysis is to find the model that better discriminates between two classes

Summary

Introduction

The process of feature or variable selection aims to identify a subset of features that are relevant with respect to a given task; for example, in regression and classification it is often desirable to select and retain only the subset of variables with the highest predictive power. Measurements taken over redundant components would be equivalent to each other, and there would be no particular reason for preferring one over the other for inclusion in a predictive subset. This problem is relevant in biology, where nature uses redundancy for ensuring resilience to shocks or adverse events

Objectives

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Statistical Software	Publication Date: Jan 1, 2017
Citations: 92	License type: cc-by

R Discovery Prime

R Discovery Prime

Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software

Lead the way for us

Similar Papers

Informative Feature Clustering and Selection for Gene Expression Data
Yuqi Yang ... Pengshuai Yin
IEEE Access | VOL. 7
Yuqi Yang, et. al.Yuqi Yang ... Pengshuai Yin
01 Jan 2019
IEEE Access | VOL. 7

A robust and accurate method for feature selection and prioritization from multi-class OMICs data.
Vittorio Fortino ... Dario Greco
PLoS ONE | VOL. 9
Vittorio Fortino, et. al.Vittorio Fortino ... Dario Greco
23 Sep 2014
PLoS ONE | VOL. 9

A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.
Qianwu Ni ... Lei Chen
Combinatorial Chemistry & High Throughput Screening | VOL. 20
Qianwu Ni, et. al.Qianwu Ni ... Lei Chen
23 Oct 2017
Combinatorial Chemistry & High Throughput Screening | VOL. 20

Differentiation of fat-poor angiomyolipoma from clear cell renal cell carcinoma in contrast-enhanced MDCT images using quantitative feature classification.
Han Sang Lee ... Dae Chul Jung
Medical Physics | VOL. 44
Han Sang Lee, et. al.Han Sang Lee ... Dae Chul Jung
09 Jun 2017
Medical Physics | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software