Abstract

BackgroundLong-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors.ResultsIn this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing.ConclusionsIn summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.

Highlights

  • Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads

  • Our evaluation demonstrated that Long-read gene fusion detector (LongGF) successfully detected candidate gene fusions from long-read RNAseq data, and some of these fusions are previously known or can be validated by additional Sanger sequencing

  • Please note that we do not consider these pseudogenes in the analysis by default, and in LongGF, users can specify whether to use pseudogenes in gene fusion detection; and (2) a long read sequence whose mapped bases of two alignment records do not have an appropriate gap: for example in a long read sequence, the mapped bases from M1 to M2 are used in one alignment record, and bases from M3 to M4 are used in another alignment record, M3 > M1, if M3 − M2 is larger than a threshold (such as 20, as shown in Fig. 1 (e)) or less than − 20, as shown in Fig. 1 (d), the two alignment records do not have an appropriate gap

Read more

Summary

Introduction

There is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Gene fusion plays a critical role in transcriptome diversity and may be associated with human diseases, especially cancer. Gene fusions can be used as biomarkers for cancer diagnosis, such as in breast cancer [16] and ovarian cancer [17], and used as therapeutic targets for cancer [18,19,20,21]. The ability to target and better understand gene fusions may lead to the development of novel targeted therapies in the future

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.