Dynamic Bayesian Network Learning to Infer Sparse Models From Time Series Gene Expression Data.

Hamda B Ajmal,Michael G Madden

doi:10.1109/tcbb.2021.3092879

Abstract

One of the key challenges in systems biology is to derive gene regulatory networks (GRNs) from complex high-dimensional sparse data. Bayesian networks (BNs) and dynamic Bayesian networks (DBNs) have been widely applied to infer GRNs from gene expression data. GRNs are typically sparse but traditional approaches of BN structure learning to elucidate GRNs often produce many spurious (false positive) edges. We present two new BN scoring functions, which are extensions to the Bayesian Information Criterion (BIC) score, with additional penalty terms and use them in conjunction with DBN structure search methods to find a graph structure that maximises the proposed scores. Our BN scoring functions offer better solutions for inferring networks with fewer spurious edges compared to the BIC score. The proposed methods are evaluated extensively on auto regressive and DREAM4 benchmarks. We found that they significantly improve the precision of the learned graphs, relative to the BIC score. The proposed methods are also evaluated on three real time series gene expression datasets. The results demonstrate that our algorithms are able to learn sparse graphs from high-dimensional time series data. The implementation of these algorithms is open source and is available in form of an R package on GitHub at https://github.com/HamdaBinteAjmal/DBN4GRN, along with the documentation and tutorials.

Highlights

W ITH the advent of advanced technologies for genome sequencing and rapid reductions in the cost of sequencing, the last couple of decades have seen an explosion of large, complex datasets generated from biological experiments
If the performance when using our score is significantly better than the performance when using the Bayesian Information Criterion (BIC) score, is in bold
The BIC score is too liberal for model selection when the model space is large as the penalty term in the BIC score is non-adaptive to the dimensionality of data [43]

Summary

Introduction

W ITH the advent of advanced technologies for genome sequencing and rapid reductions in the cost of sequencing, the last couple of decades have seen an explosion of large, complex datasets generated from biological experiments. This explosion creates challenges for the current data analysis methodologies. Delgado and Gomez-Vela [10] provide a comprehensive review of computational methods to reconstruct GRNs from data These include information theory models [11], ordinary differential equation (ODE) models [12], neural networks [13], Boolean models [14], regression-based methods [15] and BNs [16]–[21]. In a GRN, the value of each component (variable/gene) is directly dependent on the values of a relatively small number of other components within their Markov blanket [16]

Methods

Results

Conclusion