In this study, we consider the problem of variable selection and estimation in high-dimensional linear regression models when the complete data are not accessible, but only certain marginal information or summary statistics are available. This problem is motivated from the Genome-wide association studies (GWAS) that have been widely used to identify risk variants underlying complex human traits/diseases. With a large number of completed GWAS, statistical methods using summary statistics become more and more important because of restricted accessibility to individual-level data sets. Theoretically guaranteed methods are highly demanding to advance the statistical inference with a large amount of available marginal information. Here we propose an $\ell_1$ penalized approach, REMI, to estimate high dimensional regression coefficients with marginal information and external reference samples. We establish an upper bound on the error of the REMI estimator, which has the same order as that of the minimax error bound of Lasso with complete individual-level data. In particular, when marginal information is obtained from a large number of samples together with a small number of reference samples, REMI yields good estimation and prediction results, and outperforms the Lasso because the sample size of accessible individual-level data can be limited. Through simulation studies and real data analysis of the NFBC1966 GWAS data set, we demonstrate that REMI can be widely applicable. The developed R package and the codes to reproduce all the results are available at this https URL
Read full abstract