ConfFuse: High-Confidence Fusion Gene Detection across Tumor Entities.

Zhiqin Huang,David T W Jones,Peter Lichter,Yonghe Wu,Marc Zapatka

doi:10.3389/fgene.2017.00137

Zhiqin Huang, David T W Jones + Show 3 more

Open Access

https://doi.org/10.3389/fgene.2017.00137

Copy DOI

Abstract

Background: Fusion genes play an important role in the tumorigenesis of many cancers. Next-generation sequencing (NGS) technologies have been successfully applied in fusion gene detection for the last several years, and a number of NGS-based tools have been developed for identifying fusion genes during this period. Most fusion gene detection tools based on RNA-seq data report a large number of candidates (mostly false positives), making it hard to prioritize candidates for experimental validation and further analysis. Selection of reliable fusion genes for downstream analysis becomes very important in cancer research. We therefore developed confFuse, a scoring algorithm to reliably select high-confidence fusion genes which are likely to be biologically relevant.Results: confFuse takes multiple parameters into account in order to assign each fusion candidate a confidence score, of which score ≥8 indicates high-confidence fusion gene predictions. These parameters were manually curated based on our experience and on certain structural motifs of fusion genes. Compared with alternative tools, based on 96 published RNA-seq samples from different tumor entities, our method can significantly reduce the number of fusion candidates (301 high-confidence from 8,083 total predicted fusion genes) and keep high detection accuracy (recovery rate 85.7%). Validation of 18 novel, high-confidence fusions detected in three breast tumor samples resulted in a 100% validation rate.Conclusions: confFuse is a novel downstream filtering method that allows selection of highly reliable fusion gene candidates for further downstream analysis and experimental validations. confFuse is available at https://github.com/Zhiqin-HUANG/confFuse.

Highlights

A fusion gene is typically generated from two different genes due to genomic aberrations, or rarely at the transcript level
These parameter weightings were manually optimized in comparison to a known validated fusion list, in order to achieve a balance between eliminating false positives whilst retaining true fusions
One of the most important features supporting a true fusion event is the number of split reads and spanning reads. Since this is related not just to mapping performance, and to fusion gene expression levels and sequencing depth, we found that setting a simple threshold on the number of split and spanning reads could not best distinguish true and false positive predictions

Summary

Introduction

A fusion gene is typically generated from two different genes due to genomic aberrations, or rarely at the transcript level (e.g., read-through co-transcript events). It can lead to enhanced expression or altered activity of an oncogene, or deregulation of a tumor suppressor gene (Abate et al, 2014) Several technologies such as chromosome banding analysis and fluorescence in situ hybridization. A great number of fusion gene detection tools/pipelines have been developed to interrogate data from NGS, paired-end RNA-seq (Carrara et al, 2013; Kumar et al, 2016). The performance of the tools differs in terms of sensitivity and specificity, depending on the individual algorithms and filtering methods applied (Kumar et al, 2016). Each of these tools/pipelines has its own advantages and weaknesses. We developed confFuse, a scoring algorithm to reliably select high-confidence fusion genes which are likely to be biologically relevant

Objectives

Methods

Results

Conclusion