Abstract

ABSTRACTUnder the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian information criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the sliced inverse regression (SIR), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models. Supplementary materials for this article are available online.

Highlights

  • Classification, known as “supervised learning“, is a fundamental building block of statistical machine learning

  • We have reported in the Supplemental Materials a comparison between selection for Discriminant Analysis (SODA) and Lasso-logistic for variable selections when the underlying logistic regression model has only linear main effects, and found that SODA was competitive with Lasso in all cases we tested and out-performed Lasso significantly when the “incoherence” condition (Ravikumar et al, 2010) was violated

  • We study the variable and interaction selection for logistic regression with second-order terms, which covers QDA as a special case

Read more

Summary

Introduction

Classification, known as “supervised learning“, is a fundamental building block of statistical machine learning. We applied LDA, logistic regression, and QDA to train classifiers, and the classification accuracy was estimated by using 1000 additional testing samples generated from the Oracle model. Both LDA and logistic regression with only linear terms had poor prediction powers, whereas QDA improved the classification accuracy dramatically. A direct application of Lasso-logistic regression with all second-order terms is prohibitive for moderately large p (e.g., p ≥ 1000) To cope with this difficulty, Fan et al (2015) proposed innovated interaction screening (IIS) based on transforming the original

Method
Quadratic logistic regression and extended BIC
Stepwise variable and interaction selection
Preliminary main effect selection
Backward elimination
Post-selection prediction for continuous response
Implementation issues of SODA
Theoretical properties of SODA
Logistic regression with interactions
Continuous-response index models
Prediction of continuous surface
Real data analysis
Michigan lung cancer dataset
Ionosphere dataset
Pumadyn dataset
Concluding remarks
41 Figure 2
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call