BISR-RNAseq: an efficient and scalable RNAseq analysis workflow with interactive report generation

Venkat Sundar Gadepalli,Amy Webb,Hatice Gulcin Ozer,Ayse Selen Yilmaz,Maciej Pietrzak

doi:10.1186/s12859-019-3251-1

Venkat Sundar Gadepalli, Amy Webb + Show 3 more

Open Access

https://doi.org/10.1186/s12859-019-3251-1

Copy DOI

Abstract

BackgroundRNA sequencing has become an increasingly affordable way to profile gene expression patterns. Here we introduce a workflow implementing several open-source softwares that can be run on a high performance computing environment.ResultsDeveloped as a tool by the Bioinformatics Shared Resource Group (BISR) at the Ohio State University, we have applied the pipeline to a few publicly available RNAseq datasets downloaded from GEO in order to demonstrate the feasibility of this workflow. Source code is available here: workflow: https://code.bmi.osumc.edu/gadepalli.3/BISR-RNAseq-ICIBM2019 and shiny: https://code.bmi.osumc.edu/gadepalli.3/BISR_RNASeq_ICIBM19. Example dataset is demonstrated here: https://dataportal.bmi.osumc.edu/RNA_Seq/.ConclusionThe workflow allows for the analysis (alignment, QC, gene-wise counts generation) of raw RNAseq data and seamless integration of quality analysis and differential expression results into a configurable R shiny web application.

Highlights

Ribonucleic acid (RNA) sequencing has become an increasingly affordable way to profile gene expression patterns
As the Bioinformatics Shared Resource (BISR) group at Ohio State University (OSU), we developed this workflow to provide consistent analysis and reports to our collaborators
If a sample is run on multiple lanes, we recommend leaving them separate so that Quality control (QC) can be assessed on individual lanes

Summary

Introduction

RNA sequencing has become an increasingly affordable way to profile gene expression patterns. We introduce a workflow implementing several open-source softwares that can be run on a high performance computing environment. A whole transcriptome sequence provides an estimate of the quantity of all transcripts present in a group of cells. High throughput sequencing technologies have been developed to deep sequence the transcriptome. Sequencing generates several million short reads that are typically 50–400 bases in length. These reads can be mapped to a known reference genome or assembled de-novo. Either method will provide a snapshot of the transcript present in the sample and an estimate of abundance. Statistical methods have been developed to normalize and compare transcript estimates to identify differential transcripts. At each step of the bioinformatics analysis pipeline, there are many options for specific programs to use, reference

Objectives

Results

Discussion

Conclusion