Abstract

This paper considers regularizing a covariance matrix of p variables estimated from n observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data. 1. Introduction. Estimation of covariance matrices is important in a number of areas of statistical analysis, including dimension reduction by principal component analysis (PCA), classification by linear or quadratic discriminant analysis (LDA and QDA), establishing independence and conditional independence relations in the context of graphical models, and setting confidence intervals on linear functions of the means of the components. In recent years, many application areas where these tools are used have been dealing with very high-dimensional datasets, and sample sizes can be very small relative to dimension. Examples include genetic data, brain imaging, spectroscopic imaging, climate data and many others. It is well known by now that the empirical covariance matrix for samples of size n from a p-variate Gaussian distribution, Np(μ, � p), is not a good estimator of the population covariance if p is large. Many results in random matrix theory illustrate this, from the classical Mary law [29] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance matrix. Alternative estimators for large covariance matrices have therefore attracted a lot of attention recently. Two broad classes of covariance estimators have emerged: those that rely on a natural ordering among variables, and assume that variables

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call