Reproducible large-scale inference in high-dimensional nonlinear models

Emre Demirkaya

doi:10.25549/usctheses-c89-177075

Abstract

Feature selection, reproducibility, and model selection are of fundamental importance in contemporary statistics. Feature selection methods are required in a wide range of applications in order to evaluate the significance of covariates. Meanwhile, reproducibility of selected features is needed to claim that findings are meaningful and interpretable. Finally, model selection is employed for pinpointing the best set of covariates among a sequence of candidate models produced by feature selection methods. ❧ We show that p-values, a common tool for feature selection, behave differently in nonlinear models and p-values in nonlinear models can break down earlier than their linear counterparts. Next, we provide important theoretical foundations of model-X knockoffs which is a recent state-of-the-art method for reproducibility. We establish the power and robustness results for model-X knockoffs. Finally, we tackle large-scale model selection problem for misspecified models. We propose a novel information criterion which is tailored for both model misspecification and high dimensionality.

Full Text