Abstract

BackgroundThe accurate screening of tumor genomic landscapes for somatic mutations using high-throughput sequencing involves a crucial step in precise clinical diagnosis and targeted therapy. However, the complex inherent features of cancer tissue, especially, tumor genetic intra-heterogeneity coupled with the problem of sequencing and alignment artifacts, makes somatic variant calling a challenging task. Current variant filtering strategies, such as rule-based filtering and consensus voting of different algorithms, have previously helped to increase specificity, although comes at the cost of sensitivity.MethodsIn light of this, we have developed the NeoMutate framework which incorporates 7 supervised machine learning (ML) algorithms to exploit the strengths of multiple variant callers, using a non-redundant set of biological and sequence features. We benchmarked NeoMutate by simulating more than 10,000 bona fide cancer-related mutations into three well-characterized Genome in a Bottle (GIAB) reference samples.ResultsA robust and exhaustive evaluation of NeoMutate’s performance based on 5-fold cross validation experiments, in addition to 3 independent tests, demonstrated a substantially improved variant detection accuracy compared to any of its individual composite variant callers and consensus calling of multiple tools.ConclusionsWe show here that integrating multiple tools in an ensemble ML layer optimizes somatic variant detection rates, leading to a potentially improved variant selection framework for the diagnosis and treatment of cancer.

Highlights

  • One of the hallmarks of cancer is the accumulation of somatic genomic alterations, acquired during a cell’s life cycle and development [1]

  • Following the concept of aggregating the results from multiple variant calling algorithms in an ensemble machine learning (ML) layer, to yield an overall improved performance compared to the individual component tools or consensus of tools, we have developed NeoMutate

  • Generation of synthetic tumor variant datasets and next generation sequencing (NGS) data processing Following the methodology used by International Cancer Genome Consortium (ICGC)-TCGA DREAM Somatic Mutation Calling Challenge for the synthetic dataset generation, BamSurgeon was used to spike mutations into three different well-known datasets included in the 1000 Genome Project (Table 1)

Read more

Summary

Introduction

One of the hallmarks of cancer is the accumulation of somatic genomic alterations, acquired during a cell’s life cycle and development [1]. The continued advancements in variant detection in the last decade has resulted in a wide collection of bioinformatics tools based on different underlying statistical models. Their results vary widely as demonstrated in several benchmarking studies [3,4,5,6,7,8,9,10,11] and none have emerged as a gold standard and been. The complex inherent features of cancer tissue, especially, tumor genetic intra-heterogeneity coupled with the problem of sequencing and alignment artifacts, makes somatic variant calling a challenging task. Current variant filtering strategies, such as rule-based filtering and consensus voting of different algorithms, have previously helped to increase specificity, comes at the cost of sensitivity

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call