A Bayesian Framework for the Classification of Microbial Gene Activity States.

Craig Disselkoen,Kristin Koch,Reginald Lerebours,Kaitlyn Cook,Karen Fischer,Mark Cunningham,Matthew Dejongh,Joshua Cape,Yonatan Ashenafi,Chase Viss,Elizabeth Held,Aaron A Best,Brian Greco,Nathan Tintle,Allyson Acosta

doi:10.3389/fmicb.2016.01191

Abstract

Numerous methods for classifying gene activity states based on gene expression data have been proposed for use in downstream applications, such as incorporating transcriptomics data into metabolic models in order to improve resulting flux predictions. These methods often attempt to classify gene activity for each gene in each experimental condition as belonging to one of two states: active (the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active). These existing methods of classifying gene activity states suffer from multiple limitations, including enforcing unrealistic constraints on the overall proportions of active and inactive genes, failing to leverage a priori knowledge of gene co-regulation, failing to account for differences between genes, and failing to provide statistically meaningful confidence estimates. We propose a flexible Bayesian approach to classifying gene activity states based on a Gaussian mixture model. The model integrates genome-wide transcriptomics data from multiple conditions and information about gene co-regulation to provide activity state confidence estimates for each gene in each condition. We compare the performance of our novel method to existing methods on both simulated data and real data from 907 E. coli gene expression arrays, as well as a comparison with experimentally measured flux values in 29 conditions, demonstrating that our method provides more consistent and accurate results than existing methods across a variety of metrics.

Highlights

Numerous approaches to understanding and utilizing gene expression measurements attempt to classify them into one of two states: active or inactive (Ferrell, 2002; Abel et al, 2013; Gallo et al, 2015)
Our goal is to provide guidance to researchers who regularly put this intuition into practice, by assessing their methods for classifying genes into activity states based on gene expression data, and proposing statistical models for data analysis that lead to improved classifications
The Multivariate Mixture Model (MultiMM) method performed best compared to other methods when examining only the subset of genes identified as coming from a single component

Summary

Introduction

Numerous approaches to understanding and utilizing gene expression measurements attempt to classify them into one of two states: active (roughly speaking, the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active) (Ferrell, 2002; Abel et al, 2013; Gallo et al, 2015). Recent approaches to metabolic modeling (MM) have focused on the integration of multiple sources of genetic information including transcriptomics data (Pfau et al, 2011; Lewis et al, 2012; Bordbar et al, 2014; Chubukov et al, 2014; Machado and Herrgård, 2014; Monk et al, 2014; Rezola et al, 2014). In these approaches, gene activity states are usually incorporated into constraints on the fluxes through reactions associated with the gene products. While PROM classifies gene states in the standard manner, PROM does not directly constrain FBA based on the states

Objectives

Methods

Results

Discussion

Conclusion