Abstract

BackgroundMixed graphical models (MGMs) are graphical models learned over a combination of continuous and discrete variables. Mixed variable types are common in biomedical datasets. MGMs consist of a parameterized joint probability density, which implies a network structure over these heterogeneous variables. The network structure reveals direct associations between the variables and the joint probability density allows one to ask arbitrary probabilistic questions on the data. This information can be used for feature selection, classification and other important tasks.ResultsWe studied the properties of MGM learning and applications of MGMs to high-dimensional data (biological and simulated). Our results show that MGMs reliably uncover the underlying graph structure, and when used for classification, their performance is comparable to popular discriminative methods (lasso regression and support vector machines). We also show that imposing separate sparsity penalties for edges connecting different types of variables significantly improves edge recovery performance. To choose these sparsity parameters, we propose a new efficient model selection method, named Stable Edge-specific Penalty Selection (StEPS). StEPS is an expansion of an earlier method, StARS, to mixed variable types. In terms of edge recovery, StEPS selected MGMs outperform those models selected using standard techniques, including AIC, BIC and cross-validation. In addition, we use a heuristic search that is linear in size of the sparsity value search space as opposed to the cubic grid search required by other model selection methods. We applied our method to clinical and mRNA expression data from the Lung Genomics Research Consortium (LGRC) and the learned MGM correctly recovered connections between the diagnosis of obstructive or interstitial lung disease, two diagnostic breathing tests, and cigarette smoking history. Our model also suggested biologically relevant mRNA markers that are linked to these three clinical variables.ConclusionsMGMs are able to accurately recover dependencies between sets of continuous and discrete variables in both simulated and biomedical datasets. Separation of sparsity penalties by edge type is essential for accurate network edge recovery. Furthermore, our stability based method for model selection determines sparsity parameters faster and more accurately (in terms of edge recovery) than other model selection methods. With the ongoing availability of comprehensive clinical and biomedical datasets, MGMs are expected to become a valuable tool for investigating disease mechanisms and answering an array of critical healthcare questions.

Highlights

  • Mixed graphical models (MGMs) are graphical models learned over a combination of continuous and discrete variables

  • We first present an example of this behavior on a single dataset of 500 samples over 50 four-level discrete variables and 50 continuous variables generated from a scale free network structure

  • The sparse conditional Gaussian graphical model (SCGGM) had similar performance on continuous-continuous edge recovery with both datasets, but significantly worse performance on continuous-discrete edge recovery in the data with non-linear cd interactions, which resulted in worse overall performance in that setting. For these tests we found that when allowing the selection of edge type specific sparsity penalties, SCGGM chose the same penalty for the cc and cd edges in 36 out of the 40 datasets; and stable edge-specific penalty selection (StEPS) chose the same penalty for the cc and cd edges in 38 out of the 40 datasets, but a different dd penalty in all 40 cases

Read more

Summary

Introduction

Mixed graphical models (MGMs) are graphical models learned over a combination of continuous and discrete variables. MGMs consist of a parameterized joint probability density, which implies a network structure over these heterogeneous variables. The network structure reveals direct associations between the variables and the joint probability density allows one to ask arbitrary probabilistic questions on the data. This information can be used for feature selection, classification and other important tasks. Integrating biomedical datasets from different data streams (e.g., omics, clinical) and of different types (continuous, discrete) is of utmost importance and has become an analysis bottleneck in biomedical research. One would like to be able to uncover all direct associations between variables and/or perform feature selection and classification tasks using all data. Numerous biomedical data sets exist with genomic, transcriptomic and epigenetic data for each sample, a general framework for integrative analysis of these heterogeneous data is lacking

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call