Abstract

For ultrahigh-dimensional data, variable screening is an important step to reduce the scale of the problem, hence, to improve the estimation accuracy and efficiency. In this paper, we propose a new dependence measure which is called the log odds ratio statistic to be used under the sufficient variable screening framework. The sufficient variable screening approach ensures the sufficiency of the selected input features in modeling the regression function and is an enhancement of existing marginal screening methods. In addition, we propose an ensemble variable screening approach to combine the proposed fused log odds ratio filter with the fused Kolmogorov filter to achieve supreme performance by taking advantages of both filters. We establish the sure screening properties of the fused log odds ratio filter for both marginal variable screening and sufficient variable screening. Extensive simulations and a real data analysis are provided to demonstrate the usefulness of the proposed log odds ratio filter and the sufficient variable screening procedure.

Highlights

  • Ultrahigh-dimensional data have emerged recently in many areas of modern scientific research, including microarray, genomic, proteomic, brain images and genetic data

  • Under the assumption that only a small number of variables, which are usually referred as active features, among all observed input features contribute to the response variable, [10] propose the sure independent screening (SIS) method to identify a subset of features that contains the active features

  • We show that the fused log odds ratio filter enjoys sure screening properties for both marginal screening and sufficient variable screening

Read more

Summary

Introduction

Ultrahigh-dimensional data have emerged recently in many areas of modern scientific research, including microarray, genomic, proteomic, brain images and genetic data. We show that the proposed log odds ratio statistic can be used for variable screening and the log odds ratio filter is fully nonparametric and model-free It is invariant under monotone transformation on features. The log odds ratio filter can be applied to the data where the response variable and the input features are either discrete or continuous Owning their advantages over different situations, the proposed fused log odds ratio filter can be combined with the fused Kolmogorov filter as a complement to each other to achieve better performance under an ensemble approach. Additional remarks and technical proofs are included in the appendix

Framework
Algorithms
Ensemble
Motivation
Proposed methodology
The fused log odds ratio filter for sufficient variable screening
Regularity conditions for marginal screening
Sure screening property for marginal screening
Regularity conditions for sufficient screening
Sure screening property for sufficient screening
Simulations
Method
Real data example
Method K
Findings
Discussions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.