Abstract

We revisit sure independence screening procedures for variable selection in generalized linear models and the Cox proportional hazards model. Through the publicly available R package SIS, we provide a unified environment to carry out variable selection using iterative sure independence screening (ISIS) and all of its variants. For the regularization steps in the ISIS recruiting process, available penalties include the LASSO, SCAD, and MCP while the implemented variants for the screening steps are sample splitting, data-driven thresholding, and combinations thereof. Performance of these feature selection techniques is investigated by means of real and simulated data sets, where we find considerable improvements in terms of model selection and computational time between our algorithms and traditional penalized pseudo-likelihood methods applied directly to the full set of covariates.

Highlights

  • With the remarkable development of modern technology, including computing power and storage, more and more high-dimensional and high-throughput data of unprecedented size and complexity are being generated for contemporary statistical studies

  • Through the publicly available R package SIS, we provide a unified environment to carry out variable selection using iterative sure independence screening (ISIS) and all of its variants

  • For the regularization steps in the ISIS recruiting process, available penalties include the LASSO, smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP) while the implemented variants for the screening steps are sample splitting, data-driven thresholding, and combinations thereof. Performance of these feature selection techniques is investigated by means of real and simulated data sets, where we find a considerable improvements in terms of model selection and computational time between our algorithms and traditional penalized pseudo-likelihood methods applied directly to the full set of covariates

Read more

Summary

Introduction

With the remarkable development of modern technology, including computing power and storage, more and more high-dimensional and high-throughput data of unprecedented size and complexity are being generated for contemporary statistical studies. Fan and Lv (2008) introduced a new framework for variable screening via independent correlation learning that tackles the aforementioned challenges in the context of ultrahigh dimensional linear models Their proposed sure independence screening (SIS) is a two-stage procedure; first filtering out the features that have weak marginal correlation with the response, effectively reducing the dimensionality p to a moderate scale below the sample size n, and performing variable selection and parameter estimation simultaneously through a lower dimensional penalized least squares method such as SCAD ft or LASSO. Taking advantage of the fast cyclical coordinate descent algorithms developed in the packages glmnet (Friedman et al 2013) and ncvreg (Breheny 2013), for convex and nonconvex penalty functions, respectively, we are able to efficiently perform the moderate scale penalized pseudo-likelihood steps from the ISIS procedure, yielding variable selection techniques outperforming direct use of glmnet and ncvreg in terms of both computational time and estimation error.

SIS and feature ranking by maximum marginal likelihood estimators
Inputs
Outputs
Variants of ISIS
First variant of ISIS
Second variant of ISIS
Implementation details
Model selection and timings
Model selection and statistical accuracy
Method
Code example
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call