Reproducible RNA-seq data processing and analysis tools in the cloud.

William L Poehlman ,Tess Thyer,Jake Gockley,Anna K Greenwood ,Larsson Omberg,Solveig K Sieberts ,Kelsey S Montgomery ,Nicole Kauer,Mette A Peters ,Kara Woo ,James A Eddy ,Lara M Mangravite

doi:10.1002/alz.056527

Abstract

Discovering heterogeneous biological processes underlying Alzheimer's Disease (AD) is a key to prioritizing potential drug candidates. Analysis of RNA sequencing (RNA-seq) data can improve our understanding of these processes by revealing gene expression patterns associated with AD. Analyzing this data requires processing large volumes of sequencing files, as well as downstream analysis in a secure compute environment. To ensure reliable results, it is important to execute software consistently so that results can be reproduced in different environments. To help address these challenges, we have developed reproducible bioinformatic tools for raw data processing and analysis in the Amazon Web Services (AWS) cloud compute environment. We have implemented a RNA-seq processing pipeline in common workflow language (CWL). Raw sequencing reads in the form of Fastq or Bam files are aligned to the reference genome using the STAR read aligner and gene counts are quantified (https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq). In addition, we have developed an R package for gene count normalization (https://github.com/Sage-Bionetworks/sageseqr). To enable execution of these tools, we provide an analytical workspace in a secure AWS compute environment (https://adknowledgeportal.synapse.org/Analytical%20Workspace). We have utilized these tools to reprocess data from several RNA-seq studies that are available through the AD Knowledge Portal (adknowledgeportal.org) as the RNAseq Harmonization Study. As new datasets are generated, they can be processed with a consistent software environment to enable cross-study analysis. By enabling reproducible data processing, users can perform similar RNA-seq experiments without needing to implement new pipelines. The development of reproducible RNA-seq processing and analysis tools provides a valuable resource for the AD research community. While we have demonstrated execution of these tools in the cloud, they may also be executed in diverse environments such as high performance compute clusters. Our tools will remain stable resources for reproducible processing of RNA-seq datasets under evolving infrastructures.

Full Text