Abstract

MotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.Availability and implementationOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species

  • We develop the Graph BWT (GBWT), a scalable implementation of the graph extension of the positional Burrows–Wheeler transform [4, 21], to store the haplotypes as paths in the graph

  • The GBWT supports the following variants of the standard FM-index queries: find(X) returns the lexicographic range of reverse prefixes starting with the reverse pattern. locate(sp, ep) returns the haplotype identifiers DA[sp, ep]

Read more

Summary

Introduction

Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species. Because individual genomes are similar, compressed text indexes can store such collections in very little space [18] Due to this similarity, most reads map well to many haplotypes. Graphs such as de Bruijn graphs collapse sequences by local similarity instead of global alignment They are better suited to handling structural variation than DAGs. the lack of a global coordinate system limits their usefulness as references. Because they collapse sequences between variants, they represent both the original haplotypes and their recombinations, that is paths that switch between haplotypes. VG handles complex graph regions by indexing a simplified graph, the final alignment is done in the original graph The drawback of this approach is that simplification can break paths corresponding to known haplotypes, while leaving paths representing recombinations intact.

Strings and graphs
FM-index
Positional BWT
Graph extension
Records
GBWT encodings
GBWT construction
Basic construction
Construction in VG
Haplotype-aware graph simplification
Experiments
GBWT benchmarks
Haplotype-aware graphs
Discussion
Findings
10 Simon Gog et al From theory to practice
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.