A Bayesian network approach incorporating imputation of missing data enables exploratory analysis of complex causal biological relationships.

Richard Howey,Arthur G Pratt,Louise N Reynard,Heather J Cordell,Najib Naamane,Alexander D Clark,Marco Scutari

doi:10.1371/journal.pgen.1009811

Richard Howey, Arthur G Pratt + Show 5 more

Open Access

https://doi.org/10.1371/journal.pgen.1009811

Copy DOI

Abstract

Bayesian networks can be used to identify possible causal relationships between variables based on their conditional dependencies and independencies, which can be particularly useful in complex biological scenarios with many measured variables. Here we propose two improvements to an existing method for Bayesian network analysis, designed to increase the power to detect potential causal relationships between variables (including potentially a mixture of both discrete and continuous variables). Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis. This can be wasteful and undesirable when there are many individuals with missing data, perhaps with only one or a few variables missing. This motivates the use of imputation. We present a new imputation method that uses a version of nearest neighbour imputation, whereby missing data from one individual is replaced with data from another individual, their nearest neighbour. For each individual with missing data, the subsets of variables to be used to select the nearest neighbour are chosen by sampling without replacement the complete data and estimating a best fit Bayesian network. We show that this approach leads to marked improvements in the recall and precision of directed edges in the final network identified, and we illustrate the approach through application to data from a recent study investigating the causal relationship between methylation and gene expression in early inflammatory arthritis patients. We also describe a second improvement in the form of a pseudo-Bayesian approach for upweighting certain network edges, which can be useful when there is prior evidence concerning their directions.

Highlights

IntroductionGenome-wide association studies (GWAS) have had considerable success in detecting genetic variants (typically single nucleotide polymorphisms, SNPs) associated with phenotypic outcomes
Genome-wide association studies (GWAS) have had considerable success in detecting genetic variants associated with phenotypic outcomes
We show through computer simulations that our method improves the reliability of the results obtained, and we illustrate the proposed approach by applying it to data from a recent study in early inflammatory arthritis

Summary

Introduction

Genome-wide association studies (GWAS) have had considerable success in detecting genetic variants (typically single nucleotide polymorphisms, SNPs) associated with phenotypic outcomes. One possible analysis approach is to use Bayesian Networks (BNs), whereby potential causal relationships between many different genetic, biological and phenotypic variables may be explored, taking advantage of the fact that genetic variables can act as instruments to help orient the directions of relationships between other variables. This approach has been shown to perform competitively with other causal inference methods [1] and in some cases to even have advantages over competing approaches such as Mendelian Randomisation (MR) and its extensions [2]. The local probability distributions can be defined in many ways, but a popular approach, which we use throughout, is for discrete variables to take a multinomial distribution and continuous variables to take a multivariate normal distribution

Methods

Results

Discussion

Conclusion