Abstract

Genome graphs allow very general representations of genetic variation; depending on the model and implementation, variation at different length-scales (single nucleotide polymorphisms (SNPs), structural variants) and on different sequence backgrounds can be incorporated with different levels of transparency. We implement a model which handles this multiscale variation and develop a JSON extension of VCF (jVCF) allowing for variant calls on multiple references, both implemented in our software gramtools. We find gramtools outperforms existing methods for genotyping SNPs overlapping large deletions in M. tuberculosis and is able to genotype on multiple alternate backgrounds in P. falciparum, revealing previously hidden recombination.

Highlights

  • Variant calling, the detection of genetic variants from sequence data, is a fundamental process on which many other analyses rely

  • We show that gramtools outperforms vg and GraphTyper2 when genotyping long deletions and all the overlapping small variants from a cohort of 1017 Mycobacterium tuberculosis genomes

  • An output format like our proposed JavaScript Object Notation (JSON) extension of variant call format [10] (VCF) (jVCF) becomes especially important when analysing more complex variation such as SNPs on top of alternate haplotypes, where variants need to be expressed against different references. We show such an application of multiscale variation analysis using the P. falciparum surface antigen DBLMSP2, which would not be possible using the VCF files output by vg or GraphTyper2

Read more

Summary

Introduction

The detection of genetic variants from sequence data, is a fundamental process on which many other analyses rely. For PacBio/Oxford Nanopore Technology (ONT) data, genomes can be fully assembled, and the discovery and genotyping problems are in principle partially solved, by aligning each assembly against a reference. There are data structures that in principle can genotype alternate alleles which include both long structural variants and SNPs—some implementations include Cortex, GraphTyper, vg, and BayesTyper [4,5,6,7]. All of these are based on graph representations of one form or another ranging from genotyping a whole-genome de Bruijn graph

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call