With the increasing accessibility of data, sample sizes remain limited in most studies, integrating multiple-source datasets is thus desirable to improve model performance. The high dependence among covariates is a common characteristic of high-dimensional data. However, traditional approaches fail to achieve variable selection consistency when the covariates are highly correlated. To address the issues stemming from correlation, we propose the penalized factor-adjusted approach to reduce the correlations in the integrative analysis within a generalized linear model (GLM) framework for multi-source high-dimensional data. By utilizing the latent factors and idiosyncratic components as predictors, the proposed approach enables model estimation with weakly correlated covariates. We rigorously establish the consistency properties under certain conditions. Simulation demonstrates the superior and competitive performance of the proposed approach in estimation, prediction, and variable selection. Analysis of genetic data on prostate cancer confirms its practical usefulness.
Read full abstract