Abstract

SummaryWe propose a family of tests to assess the goodness of fit of a high dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalized linear model. This can be achieved by predicting this signal from the residuals by using modern powerful regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness of fit in logistic regression models. Software implementing the methodology is available in the R package GRPtests.

Highlights

  • In recent years, there has been substantial progress in developing methodology for estimation in generalized linear models (GLMs) in high dimensional settings, where the number of covariates in the model may be much larger than the number of observations

  • A standard technique for estimation is the lasso for GLMs (Park and Hastie, 2007), which has a fast implementation in the R package glmnet (Friedman et al, 2010) and is widely used

  • Once a GLM has been fitted to the high dimensional data, it is important to assess the quality of the fit

Read more

Summary

Introduction

There has been substantial progress in developing methodology for estimation in generalized linear models (GLMs) in high dimensional settings, where the number of covariates in the model may be much larger than the number of observations. The lasso enjoys good empirical and theoretical properties for estimation and variable selection, provided that we are searching for a sparse approximation to the regression coefficients in the GLM. Once a GLM has been fitted to the high dimensional data, it is important to assess the quality of the fit. Literature on testing goodness of fit in low dimensional settings is extensive: we refer to Section 1.2 below for an overview. The methods typically rely on properties that hold only in low dimensional settings such as asymptotic linearity and normality of the maximum

Objectives
Methods
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.