The research on statistical inference after data-driven model selection can be traced as far back as Koopmans (1949). The intensive research on modern model selection methods for high-dimensional data over the past three decades revived the interest in statistical inference after model selection. In recent years, there has been a surge of articles on statistical inference after model selection and now a rather vast literature exists on this topic. Our manuscript aims at presenting a holistic review of post-model-selection inference in linear regression models, while also incorporating perspectives from high-dimensional inference in these models. We first give a simulated example motivating the necessity for valid statistical inference after model selection. We then provide theoretical insights explaining the phenomena observed in the example. This is done through a literature survey on the post-selection sampling distribution of regression parameter estimators and properties of coverage probabilities of naïve confidence intervals. Categorized according to two types of estimation targets, namely the population- and projection-based regression coefficients, we present a review of recent uncertainty assessment methods. We also discuss possible pros and cons for the confidence intervals constructed by different methods.
Read full abstract