Abstract

We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs.

Highlights

  • Advances in the study of functional genomics over the past decade have produced a vast resource of RNA-seq datasets

  • SeqOthello data structure A sequencing experiment can be represented by a collection of k-mers, or length k subsequences of the original reads. k-mers are fundamental components of de Bruijn graphs and are essential for de novo transcriptome assembly [19,20,21]

  • SeqOthello relies on a novel, hierarchical indexing structure to facilitate fast retrieval of k-mer occurrence maps (Fig. 1a)

Read more

Summary

Introduction

Advances in the study of functional genomics over the past decade have produced a vast resource of RNA-seq datasets. As of December 2017, over 12 petabytes of RNA-seq data were deposited in the Sequence Read Archive (SRA) [1]. Sequencing consortiums such as The Cancer Genome Atlas (TCGA) [2] and the International Cancer Genomics Consortium (ICGC) [3] have sequenced tens of thousands of tumor transcriptomes from diverse cancer populations. These datasets have collectively redefined the landscape of cancer transcriptomes, additional clinically relevant features remain to be discovered. The development of SeqOthello will enable labs with limited resources to learn from sequence-level data by supporting fast and memory-efficient query over large-scale RNA-seq datasets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call