Abstract

The availability of bacterial transcriptomes has dramatically increased in recent years. This data deluge could result in detailed inference of underlying regulatory networks, but the diversity of experimental platforms and protocols introduces critical biases that could hinder scalable analysis of existing data. Here, we show that the underlying structure of the E. coli transcriptome, as determined by Independent Component Analysis (ICA), is conserved across multiple independent datasets, including both RNA-seq and microarray datasets. We subsequently combined five transcriptomics datasets into a large compendium containing over 800 expression profiles and discovered that its underlying ICA-based structure was still comparable to that of the individual datasets. With this understanding, we expanded our analysis to over 3,000 E. coli expression profiles and predicted three high-impact regulons that respond to oxidative stress, anaerobiosis, and antibiotic treatment. ICA thus enables deep analysis of disparate data to uncover new insights that were not visible in the individual datasets.

Highlights

  • Available datasets, such as the NCBI Gene Expression Omnibus (GEO) [1] and Array Express [2], contain thousands of transcriptomics datasets that are often designed and analyzed for a specific study

  • We showed that independent component analysis (ICA), a signal deconvolution algorithm, can separate a large bacterial gene expression dataset into groups of co-regulated genes

  • We show that ICA finds similar co-regulation patterns underlying multiple gene expression datasets and can be used as a tool to integrate and interpret diverse datasets

Read more

Summary

Introduction

Available datasets, such as the NCBI Gene Expression Omnibus (GEO) [1] and Array Express [2], contain thousands of transcriptomics datasets that are often designed and analyzed for a specific study. Microarrays were the platform of choice for transcriptomic interrogation, resulting in large, publicly available datasets containing thousands of expression profiles for a variety of organisms [3,4]. Multiple consortia have performed extensive comparisons of expression levels across different microarray and RNA-seq platforms [6,7,8]. These studies showed that absolute gene expression levels cannot be accurately measured by either expression profiling technique, whereas relative abundances are consistent across a wide range of transcriptomics platforms with appropriate quality controls. Batch effects and technical heterogeneity continue to present significant challenges to successful integration of omics datasets [9]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call