Abstract
MotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.Availability and implementationOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species
We develop the Graph BWT (GBWT), a scalable implementation of the graph extension of the positional Burrows–Wheeler transform [4, 21], to store the haplotypes as paths in the graph
The GBWT supports the following variants of the standard FM-index queries: find(X) returns the lexicographic range of reverse prefixes starting with the reverse pattern. locate(sp, ep) returns the haplotype identifiers DA[sp, ep]
Summary
Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species. Because individual genomes are similar, compressed text indexes can store such collections in very little space [18] Due to this similarity, most reads map well to many haplotypes. Graphs such as de Bruijn graphs collapse sequences by local similarity instead of global alignment They are better suited to handling structural variation than DAGs. the lack of a global coordinate system limits their usefulness as references. Because they collapse sequences between variants, they represent both the original haplotypes and their recombinations, that is paths that switch between haplotypes. VG handles complex graph regions by indexing a simplified graph, the final alignment is done in the original graph The drawback of this approach is that simplification can break paths corresponding to known haplotypes, while leaving paths representing recombinations intact.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.