Abstract

BackgroundRead alignment and transcript assembly are the core of RNA-seq analysis for transcript isoform discovery. Nonetheless, current tools are not designed to be scalable for analysis of full-length bulk or single cell RNA-seq (scRNA-seq) data. The previous version of our cloud-based tool Falco only focuses on RNA-seq read counting, but does not allow for more flexible steps such as alignment and read assembly.ResultsThe Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data. There are two new modes in Falco: alignment-only and transcript assembly. In the alignment-only mode, Falco can speed up the alignment process by 2.5–16.4x based on two public scRNA-seq datasets when compared to alignment on a highly optimised standalone computer. Furthermore, it also provides a 10x average speed-up compared to alignment using published cloud-enabled tool for read alignment, Rail-RNA. In the transcript assembly mode, Falco can speed up the transcript assembly process by 1.7–16.5x compared to performing transcript assembly on a highly optimised computer.ConclusionFalco is a significantly updated open source big data processing framework that enables scalable and accelerated alignment and assembly of full-length scRNA-seq data on the cloud. The source code can be found at https://github.com/VCCRI/Falco.

Highlights

  • Read alignment and transcript assembly are the core of Ribonucleic acid (RNA) sequencing (RNA-seq) analysis for transcript isoform discovery

  • We describe the development of the Falco framework which incorporates two additional modes of analysis: (1) alignment-only mode, where the output is an alignment file for each sample, and (2) transcript assembly mode, where the output is a reconstructed transcript isoform annotation based on the data

  • The Falco framework extends the capability of the existing Falco framework through the addition of two processing modes - alignment-only and transcript assembly modes

Read more

Summary

Results

Evaluation of Falco alignment-only mode One of the features of the read-quantification mode in the initial version of the Falco framework is the production of the gene expression matrix that is identical to that produced in a sequential analysis, where reads are not split into smaller chunks. The serialisation issue, on the other hand, was encountered when performing transcript assembly using StringTie on the two largest bins in the mouse dataset, containing around 30 million reads each (Additional file 4), and is a result of the limitation of the PySpark framework used for developing the Falco framework. In both PySpark and Spark frameworks, the input. This last approach will require some manual intervention from the user as they need to merge the alignment results and perform transcript assembly

Conclusion
Background
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call