MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Andrea Hita,Gilles Brocart,Anna Alemany,Ana Fernandez,Sol Schvartzman,Marc Rehmsmeier

doi:10.1186/s12859-021-04544-3

Andrea Hita, Gilles Brocart + Show 4 more

Open Access

https://doi.org/10.1186/s12859-021-04544-3

Copy DOI

Abstract

BackgroundTotal-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches.ResultsHere we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program.ConclusionsMGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount.

Highlights

Total-RNA sequencing allows the simultaneous study of both the coding and the non-coding transcriptome
Hierarchical assignation resolves small‐RNA long‐RNA multi‐overlappers In order to assess the potential impact of overlapping features from different biotypes on RNA-seq analysis, we explored their overlap frequencies
We used as a reference the customized gene transfer format (GTF) file that integrate several databases for the following species: H. sapiens, M. musculus, C. elegans and A. thaliana (Additional file 1, a-d)

Summary

Introduction

Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. While early NGS experiments focused on the detection of polyadenylated RNA (i.e., messenger RNA [mRNA] and polyadenylated long non-coding RNA [lncRNA]), later RNA library preparation methods made it possible to target small regulatory RNAs (small RNAs) [7,8,9] and full transcriptomes (hereafter referred to as total-RNAseq). Total-RNA-seq simultaneously captures polyadenylated RNA and non-polyadenylated RNA, which together include all types of mRNA, lncRNA, and small RNA, both as precursors and in processed forms. With total-RNA-seq library preparation methods recently having reached single-cell resolution [10,11,12,13,14], it has become possible to investigate transcriptional regulation through non-coding RNA with unprecedented detail. Total-RNA-seq analysis needs to integrate a ANNOTATION SOURCES Which ones to use (input)?

Results

Discussion

Conclusion