Abstract

Targeted sequencing is commonly used in clinical application of NGS technology since it enables generation of sufficient sequencing depth in the targeted genes of interest and thus ensures the best possible downstream analysis. This notwithstanding, the accurate discovery and annotation of disease causing mutations remains a challenging problem even in such favorable context. The difficulty is particularly salient in the case of third generation sequencing technology, such as PacBio. We present MICADo, a de Bruijn graph based method, implemented in python, that makes possible to distinguish between patient specific mutations and other alterations for targeted sequencing of a cohort of patients. MICADo analyses NGS reads for each sample within the context of the data of the whole cohort in order to capture the differences between specificities of the sample with respect to the cohort. MICADo is particularly suitable for sequencing data from highly heterogeneous samples, especially when it involves high rates of non-uniform sequencing errors. It was validated on PacBio sequencing datasets from several cohorts of patients. The comparison with two widely used available tools, namely VarScan and GATK, shows that MICADo is more accurate, especially when true mutations have frequencies close to backgound noise. The source code is available at http://github.com/cbib/MICADo.

Highlights

  • Capturing known cancer genes by generation sequencing, approach known as “gene panel” or targeted sequencing, is commonly used for tumor genotyping

  • MICADo was evaluated on Pacific Biosciences (PacBio) sequencing datasets: (i) a novel sequencing of TP53 of a breast cancer cohort, (ii) a publicly available dataset of FLT3 sequencing of an acute myeloid leukemia cohort, and (iii) a synthetic dataset

  • Three pipelines based on GATK, VarScan, and MICADo were evaluated on both synthetic and real data

Read more

Summary

INTRODUCTION

Capturing known cancer genes by generation sequencing, approach known as “gene panel” or targeted sequencing, is commonly used for tumor genotyping. Despite the existence of these numerous computational solutions, calling somatic mutations in cancer data remains challenging due to a number of factors like technical artifacts, sequencing errors, biases of alignment algorithms, DNA contamination (control samples contaminated with tumor DNA), and tumor heterogeneity. This issue is even more salient for the third generation sequencing data, such as PacBio. since very high read depths are required for achieving sequence accuracy close to that of Illumina and Ion Torrent (Quail et al, 2012), variant calling potentially suffers from high false positive and negative rates. MICADo was evaluated on PacBio sequencing datasets: (i) a novel sequencing of TP53 of a breast cancer cohort, (ii) a publicly available dataset of FLT3 sequencing of an acute myeloid leukemia cohort, and (iii) a synthetic dataset

MICADo Approach
Datasets
Construction of de Bruijn Graphs
Alternative Path Search
Alternative Path Specificity and Variant Calling
RESULTS
Evaluation on Synthetic Data
TP53 Targeted Data
FLT3 Targeted Data
DISCUSSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.