Reproducibility is a fundamental expectation in science and enables investigators to have confidence in their research findings and the ability to compare data from disparate sources, but evaluating reproducibility can be elusive. For example, generating RNA sequencing (RNA-seq) data includes multiple steps where variance can be introduced. Thus, it is unclear if RNA-seq data from different sources can be validly compared. While most studies on RNA-seq reproducibility focus on eukaryotes, we evaluate bias in bacteria using Pseudomonas aeruginosa gene expression data from five laboratory models of cystic fibrosis. We leverage a large data set that includes samples prepared in three different laboratories and paired data sets where the same sample was sequenced using at least two different sequencing pipelines. We report here that expression data are highly reproducible across laboratories. In addition, while samples sequenced with different sequencing pipelines showed significantly more variance in expression profiles than between labs, gene expression was still highly reproducible between sequencing pipelines. Further investigation of expression differences between two sequencing pipelines revealed that library preparation methods were the largest source of error, though analyses to identify the source of this variance were inconclusive. Consistent with the reproducibility of expression between sequencing pipelines, we found that different pipelines detected over 80% of the same differentially expressed genes with large expression differences between conditions. Thus, bacterial RNA-seq data from different sources can be validly compared, facilitating the ability to advance understanding of bacterial behavior and physiology using the wide array of publicly available RNA-seq data sets.IMPORTANCERNA sequencing (RNA-seq) has revolutionized biology, but many steps in RNA-seq workflows can introduce variance, potentially compromising reproducibility. While reproducibility in RNA-seq has been thoroughly investigated in eukaryotes, less is known about pipelines and workflows that introduce variance and biases in bacterial RNA-seq data. By leveraging Pseudomonas aeruginosa transcriptomes in cystic fibrosis models from different laboratories and sequenced with different sequencing pipelines, we directly assess sources of bacterial RNA-seq variance. RNA-seq data were highly reproducible, with the largest variance due to sequencing pipelines, specifically library preparation. Different sequencing pipelines detected overlapping differentially expressed genes, especially those with large expression differences between conditions. This study confirms that different approaches to preparing and sequencing bacterial RNA libraries capture comparable transcriptional profiles, supporting investigators' ability to leverage diverse RNA-seq data sets to advance their science.
Read full abstract