Abstract

ABSTRACTIndependence screening is powerful for variable selection when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or its variants. When some prior knowledge on a certain important set of variables is available, a natural assessment on the relative importance of the other predictors is their conditional contributions to the response given the known set of variables. This results in conditional sure independence screening (CSIS). CSIS produces a rich family of alternative screening methods by different choices of the conditioning set and can help reduce the number of false positive and false negative selections when covariates are highly correlated. This article proposes and studies CSIS in generalized linear models. We give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency and the properties of CSIS when a data-driven conditioning set is used. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real datasets. Supplementary materials for this article are available online.

Highlights

  • Statisticians are nowadays frequently confronted with massive data sets from various frontiers of scientific research

  • A natural assessment on the relative importance of the other predictors is the conditional contributions of the individual predictors in presence of the known set of variables. This results in conditional sure independence screening (CSIS)

  • Over the last ten years, there has been many exciting developments in statistics and machine learning on variable selection techniques for ultrahigh dimensional feature space

Read more

Summary

INTRODUCTION

Statisticians are nowadays frequently confronted with massive data sets from various frontiers of scientific research. Consider the linear model (1) again with sparse regression coefficients β⋆ = (10, 0, · · · , 0, 1)T , equi-correlation 0.9 among all covariates except X2000, which is independent of the rest of the covariates By using the conditional screening approach in which the covariate X1 is conditioned upon (used in the joint fit), marginal utilities of the spurious variables are significantly reduced. The distributions of the average of the magnitude of the conditional fitted coefficients {|βCMj |}1j=9929 and |βCM2000| are shown in the middle panel of Figure 2. As shown by Fan and Lv (2008) and Fan and Song (2010), for a given threshold of marginal utility, the size of the selected variables depends on the correlation among covariates, as measured by the largest eigenvalue of Σ: λmax (Σ).

Generalized Linear Models
Conditional Screening
SURE SCREENING PROPERTIES
Properties on Population Level
Properties on Sample Level
SELECTION OF THE THRESHOLDING PARAMETER
Controlling FDR
Random Decoupling
Simulation Study
Normal model
Binomial model
Robustness of CSIS
Leukemia Data
Financial Data
Proof of Theorem 1
Proof of Theorem 2
Proof of Theorem 3
The Fisher information
Proof of Theorem 4
Findings
Proof of Theorem 5

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.