Abstract

Introduction: Our goal is to improve gene fusion detection via RNA sequencing by combining multiple fusion callers through machine learning techniques. Background: Gene Fusion events are important drivers of malignancy. RNA sequencing (RNAseq) methods for detection of fusions have the advantage that multiple markers can be targeted at one time. Unlike DNA methods, in which it is challenging to capture fusion breakpoints, in RNA methods fusions are readily identified through chimeric transcripts. While many fusion calling algorithms exist for use on RNAseq data, sensitive fusion callers, needed for samples of low tumor content, often present high false positive rates - a result of aligning chimeric transcripts. Further, there currently is no single feature in NGS data that can be used to filter out false positive fusion calls. In order to achieve higher accuracy in fusion calls than can be achieved using individual fusion callers, we have weighted and combined the results of multiple fusion callers by systematic and objective means: an ensemble learning approach based on random forest models. Our method selects from data generated by three independent fusion callers supplemented by metrics obtained from in-house methods. It presents a metric that can be immediately interpreted as the probability that a candidate fusion call is a true fusion call. Methods: Random forest models were generated by use of the randomForest package in R, with tuning by the R caret package. Training data sets consisted of a balanced set of 394 fusion calls from clinical samples of solid tumors. For training, fusion calls with at least 10 supporting reads were deemed true or false based on manual review via IGV, and orthogonal methods including PCR with Sanger sequencing and the commercial Archer™ fusion CTL and Sarcoma panels. We present the results of training on data from the three well-known fusion callers Arriba, STAR-Fusion, and FusionCatcher, together with additional data from an in-house developed junction counting method, and fusion membership in a list of known fusions (a “white list”). Models were validated by 10-fold cross-validation. Results: In performance evaluations, false positive and false negative calls were presumed false based on orthogonal determinations. On that basis, our current best model has an accuracy of 94.9% (sensitivity 93.4%, specificity 96.7%). Currently, High Confidence fusion calls (calls with probability score greater than 70%) are the most common positive calls. These have been confirmed with 100% success. Conclusion: We have successfully integrated multiple fusion callers by means of random forest models. Our current model is validated for use on our solid tumor fusion calling pipeline. Citation Format: Kenneth B. Thomas, Yanglong Mou, Christophe Magnan, Tibor Gyuris, Eve Shinbrot, Fernando Lopez Diaz, Steven Lau-Rivera, Segun Jung, Vincent Funari, Lawrence M. Weiss. Gene fusion calling from RNA panel sequencing data: An ensemble learning approach [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 240.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call