Abstract

Abstract Public archives of sequencing data continue to grow at a rapid pace. (Re)analysis of large datasets can be both expensive and resource-intensive, and users are often only interested in the sample prevalence of a small subset of sequences. We have implemented a Sequence Bloom Tree data structure for the TCGA RNA-seq datasets, allowing researchers to rapidly test samples for the presence of sequences of interest. We demonstrate the ability to rapidly identify samples containing viral transcript sequences, estimate the sample prevalence of gene fusions and novel splice variants, and infer the presence of HeLa-cell contamination in a subset of TCGA data. We have implemented post-querying controls to mitigate false positives arising from the presence of genes with highly similar sequences. The tools are described using Common Workflow Language, allowing researchers to reproducibly generate and query the dataset. Citation Format: Erik D. Lehnert, Eric Freeman, Julia Salzman. Precise and rapid detection of gene fusions and microbial pathogens in next-generation sequencing data with sequence bloom trees [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr LB-005. doi:10.1158/1538-7445.AM2017-LB-005

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.