Abstract

SummaryInfluence diagnosis is an integrated component of data analysis but has been severely underinvestigated in a high dimensional regression setting. One of the key challenges, even in a fixed dimensional setting, is how to deal with multiple influential points that give rise to masking and swamping effects. The paper proposes a novel group deletion procedure referred to as multiple influential point detection by studying two extreme statistics based on a marginal-correlation-based influence measure. Named the min- and max-statistics, they have complementary properties in that the max-statistic is effective for overcoming the masking effect whereas the min-statistic is useful for overcoming the swamping effect. Combining their strengths, we further propose an efficient algorithm that can detect influential points with a prespecified false discovery rate. The influential point detection procedure proposed is simple to implement and efficient to run and enjoys attractive theoretical properties. Its effectiveness is verified empirically via extensive simulation study and data analysis. An R package implementing the procedure is freely available.

Highlights

  • Recent decades have witnessed an explosion of high dimensional data in applied fields including biology, where Yi engineering, finance and ∈ R is the response and Xi many other areas

  • With a prespecified false discovery rate (FDR) of 0:05, using the min-statistic, we identify a set of seven influential observations, represented as the full circles in Figs 1(a) and 1(b)

  • We show in theorem 1 that, surprisingly, when there is no influential point, these two statistics both follow a χ2.1/ distribution

Read more

Summary

Introduction

Recent decades have witnessed an explosion of high dimensional data in applied fields including biology, where Yi engineering, finance and ∈ R is the response and Xi many other areas. Given a ∈ Rp is the covariate for the data set consisting of {Xi, Yi}ni=1 ith observation, the main interest is often to conduct a regression analysis to relate Y to X, the simplest model for which takes the linear form. A usual assumption in linear regression is that the observations are all generated from the same model. The data that are collected often contain contaminated or noisy observations due to a plethora of reasons. Those observations exerting great influence on statistical analysis, named influential points, can seriously distort all aspects of data analysis such as altering the estimate of the regression coefficient and swaying the outcome of

Objectives
Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call