Abstract
Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the 'Bless of Dimensionality'. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.
Highlights
Aggregating and analyzing heterogeneous data is one of the most fundamental challenges in scientific data analysis
We model the heterogeneity by a semiparametric factor model
We introduce the ALPHA framework for heterogeneity adjustment
Summary
Aggregating and analyzing heterogeneous data is one of the most fundamental challenges in scientific data analysis. To properly analyze data aggregated from multiple sources, we need to carefully model and adjust the heterogeneity effect. There is still a gap that exists between practice and theories To bridge this gap, we propose a generic theoretical framework to model, estimate, and adjust heterogeneity across multiple datasets. We denote Ui = Xit − ΛiFi to be the heterogeneity adjusted signal, which can be treated as homogeneous across different datasets and can be combined together for downstream statistical analysis. The idea of covariate-adjusted precision matrix estimation has been studied by Cai et al (2012), but the factor model they used assumes observed factors and no heterogeneity issue, i.e., m = 1.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.