Abstract

Over recent years, multiple groups have shown that a large number of structural variants, repeats, or problems with the underlying genome assembly have dramatic effects on the mapping, calling, and overall reliability of single nucleotide polymorphism calls. This project endeavored to develop an easy-to-use track for looking at structural variant and repeat regions. This track, DangerTrack, can be displayed alongside the existing Genome Reference Consortium assembly tracks to warn clinicians and biologists when variants of interest may be incorrectly called, of dubious quality, or on an insertion or copy number expansion. While mapping and variant calling can be automated, it is our opinion that when these regions are of interest to a particular clinical or research group, they warrant a careful examination, potentially involving localized reassembly. DangerTrack is available at https://github.com/DCGenomics/DangerTrack.

Highlights

  • The advent of generation sequencing has enabled the comparison of cells, organisms, and even populations at the genomic level

  • Multiple studies so far have suffered from mapping artifacts typically occurring in highly variable regions, including single nucleotide polyporphisms (SNPs) and structural variants (SVs), which may be repetitive regions or regions that are not correctly represented by the reference genome (Degner et al, 2009)

  • Data exploration To assess the ability of DangerTrack to highlight suspicious regions, we computed the DangerTrack score over the human reference genome using data from the 1000 Genomes Project and Genome in a Bottle (GIAB), as well as mappability tracks from UCSC

Read more

Summary

Introduction

The advent of generation sequencing has enabled the comparison of cells, organisms, and even populations at the genomic level. Multiple methods have been suggested to overcome this bias, including constructing a personalized reference genome (Satya et al, 2012), sequencing the parental genomes (Graze et al, 2012), building graph genomes over all known variants (Dilthey et al, 2015), or carefully reconciling particular subregions. The latter includes discarding reads using a mapping quality filter, realigning reads locally, or computing a localized de novo assembly using the Genome Analysis Toolkit to improve the quality of SNP calls. All these methods often depend on the sample quality (e.g. coverage, error rate), may result in additional expenses, and are often optimized only for human genome data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call