Abstract
ABSTRACT We consider the problem of learning a conditional Gaussian graphical model in the presence of latent variables. Building on recent advances in this field, we suggest a method that decomposes the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this estimator and show that it is well-behaved in the high-dimensional regime as well as “sparsistent” (i.e., capable of recovering the graph structure). We then show how proximal gradient algorithms and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts, for example, by accommodating more latent variables. Finally, the suggested method is applied to two datasets comprising individual level data on genetic variants and metabolites levels. We show our results replicate better than alternative approaches and show enriched biological signal. Supplementary materials for this article are available online.
Highlights
The task of performing graphical model selection arises in many applications in science and engineering
We study the properties of the proposed model on synthetic data and compare its performances to the three other methods introduced earlier: the graphical lasso (GLASSO) (Friedman, Hastie, and Tibshirani 2008), the sparse conditional Gaussian graphical model (SCGGM) (Sohn and Kim 2012; Zhang and Kim 2014; Wytock and Kolter 2013) and the low-rank plus sparse decomposition (LR+S) (Chandrasekaran, Parrilo, and Willsky 2012)
By analogy to the Area Under the Curve (AUC) metric, we report the “volume under the surface” (VUS) which accounts for the effect of both regularization parameters
Summary
The task of performing graphical model selection arises in many applications in science and engineering. It is common that only a subset of the relevant variables are observed and estimators that do not account for hidden variables are prone to confounding. On the other hand, modeling latent variables is itself difficult because of identifiability and tractability issues. The number of variables being modeled is often greater than the number of samples. It is well known that, in such a scaling regime, obtaining a consistent estimator is usually impossible without making further assumptions about the model, for example, sparsity or low-dimensionality. Modeling the joint distribution over all observed variables is not always relevant. It is sometimes preferable to learn a graphical model over a number of variables of interest while conditioning on the rest of the collection
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.