Multi-omic data integration enables discovery of hidden biological regularities

Ali Ebrahim,Adam M Feist,Donghyuk Kim,Anand Sastry,Justin Tan,Bernhard O Palsson,Edward J O'Brien,Aarash Bordbar,Elizabeth Brunk,Richard Szubin,Joshua A Lerman,Anna Lechner

doi:10.1038/ncomms13091

Abstract

Rapid growth in size and complexity of biological data sets has led to the ‘Big Data to Knowledge' challenge. We develop advanced data integration methods for multi-level analysis of genomic, transcriptomic, ribosomal profiling, proteomic and fluxomic data. First, we show that pairwise integration of primary omics data reveals regularities that tie cellular processes together in Escherichia coli: the number of protein molecules made per mRNA transcript and the number of ribosomes required per translated protein molecule. Second, we show that genome-scale models, based on genomic and bibliomic data, enable quantitative synchronization of disparate data types. Integrating omics data with models enabled the discovery of two novel regularities: condition invariant in vivo turnover rates of enzymes and the correlation of protein structural motifs and translational pausing. These regularities can be formally represented in a computable format allowing for coherent interpretation and prediction of fitness and selection that underlies cellular physiology.

Highlights

Rapid growth in size and complexity of biological data sets has led to the ‘Big Data to Knowledge’ challenge
Do sequence-specific motifs drive co-translational pausing to ensure proper protein folding? We find that Shine–Dalgarno (SD)-like sequences account for 20–22% of ribosome density at pause sites (Fig. 2c and see ‘Identification of SD-like codons’ in Methods), which is consistent with recent studies[25], and four times less frequent than what is found previous studies[20]
The unprecedented growth in the type, size and complexity of biological data sets over the past couple of decades has led to a pressing grand challenge in biology referred to as BD2K

Summary

Introduction

Rapid growth in size and complexity of biological data sets has led to the ‘Big Data to Knowledge’ challenge. Progress of the biological sciences in the era of big data will depend on how we address the following question: ‘How do we connect multiple disparate data types[1] to obtain a meaningful understanding of the biological functions of an organism2?’ Owing to large-scale improvements in omics technologies, we can quantitatively track changes in biological processes in unprecedented detail[3,4] Such measurements span a diverse range of cellular activities, developing an understanding of how these data types quantitatively relate to one another and to the phenotypic characteristics of the organism remains elusive. The approach directly addresses the BD2K grand challenge and is made conceptually accessible by tracing the ‘information flow’ through the familiar ‘central dogma’, to establish relationships between measurements and cell physiology (Fig. 1)

Methods

Results

Conclusion