Abstract

The research on statistical inference after data-driven model selection can be traced as far back as Koopmans (1949). The intensive research on modern model selection methods for high-dimensional data over the past three decades revived the interest in statistical inference after model selection. In recent years, there has been a surge of articles on statistical inference after model selection and now a rather vast literature exists on this topic. Our manuscript aims at presenting a holistic review of post-model-selection inference in linear regression models, while also incorporating perspectives from high-dimensional inference in these models. We first give a simulated example motivating the necessity for valid statistical inference after model selection. We then provide theoretical insights explaining the phenomena observed in the example. This is done through a literature survey on the post-selection sampling distribution of regression parameter estimators and properties of coverage probabilities of naïve confidence intervals. Categorized according to two types of estimation targets, namely the population- and projection-based regression coefficients, we present a review of recent uncertainty assessment methods. We also discuss possible pros and cons for the confidence intervals constructed by different methods.

Highlights

  • Relies on one important assumption pertinent to the first issue on model specification, that is a correct model which accurately characterizes the true data generating mechanism is known, except for certain parameter values, before valid parameter estimation and uncertainty assessment is executed

  • Both EPoSI1 and EPoSI2 can have infinite lengths depending on the type of the design matrix

  • EPoSI1 is more likely to be of finite length than EPoSI2

Read more

Summary

Introduction

Relies on one important assumption pertinent to the first issue on model specification, that is a correct model which accurately characterizes the true data generating mechanism is known, except for certain parameter values, before valid parameter estimation and uncertainty assessment is executed. In view of the foregoing reasoning and perspectives from Ioannidis (2005) and Benjamini (2020) on replicability of results, given that the presumed condition of Fisher’s likelihood method is infringed since one data set is simultaneously used for model selection and statistical inference, one would naturally ask what detrimental effects this violation might provoke on parameter estimation and precision assessment so that the resulting measures (without adjustments), such as p-values or confidence intervals, represent sources of concern across scientific communities To answer this question, several important contributions have been accomplished over the past two decades, Potscher (1991, 1995); Potscher & NovAk (1998); Leeb & Potscher (2003, 2005, 2006, 2008); Kabaila (1995, 1998, 2005, 2009); Kabaila & Leeb (2006); Kabaila & Giri (2009) and Berk et al (2009), among others. They introduced an R package hdi (high-d imensional i nference)

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call