Abstract

The concept of P-value was proposed by Fisher to measure inconsistency of data with a specified null hypothesis, and it plays a central role in statistical inference. For classical linear regression analysis, it is a standard procedure to calculate P-values for regression coefficients based on least squares estimator (LSE) to determine their significance. However, for high dimensional data when the number of predictors exceeds the sample size, ordinary least squares are no longer proper and there is not a valid definition for P-values based on LSE. It is also challenging to define sensible P-values for other high dimensional regression methods such as penalization and resampling methods. In this paper, we introduce a new concept called oracle P-value to generalize traditional P-values based on LSE to high dimensional sparse regression models. Then we propose several estimation procedures to approximate oracle P-values for real data analysis. We show that the oracle P-value framework is useful for developing new and powerful tools to enhance high dimensional data analysis, including variable ranking, variable selection, and screening procedures with false discovery rate (FDR) control. Numerical examples are then presented to demonstrate performance of the proposed methods.

Highlights

  • Many contemporary data are featured with high dimensionality

  • Instead we introduce a concept of oracle P-value for a high dimensional sparse linear model

  • We propose an false discovery rate (FDR) approach based on oracle P-values for high dimensional sparse linear regression

Read more

Summary

Introduction

Many contemporary data are featured with high dimensionality. Given a set of observations {(xi, yi)}ni=1, where xi ∈ Rp is a predictor and yi is a response, the dimension p is oftentimes comparable to or much larger than the sample size n. There are some recent proposals of P-values for high dimensional linear regression, including the screen-and-clean approach (Wasserman & Roeder, 2009), the multi-split approach (Meinshausen et al, 2009), and the low dimensional projection (LDP) approach (Zhang & Zhang, 2014; Buhlmann et al, 2013), and a recent work on hypothesis tests for generic penalized M-estimators (Ning & Liu, 2016). The LDP approach is proposed to construct confidence intervals for regression coefficients in high-dimensional situations, which focuses on a relevant question rather than on directly defining valid P-values Another seminal work is Fan et al (2012b) which discusses the issue of FDP control based on marginal regression models instead of joint linear regression models. We illustrate how the oracle P-value can be useful to enhance variable ranking and screening with FDR control, which makes it a valuable tool for high dimensional modeling and inference.

Oracle P-value
Mimicking oracle P-values
Variable screening
Variable ranking
Numerical results
Distribution of oracle P-values
Variable selection under FDR control
Real data analysis
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.