Haplotype-aware graph indexes.

Jouni Sirén,Richard Durbin,Adam M Novak,Erik Garrison,Benedict Paten,Alfonso Valencia

doi:10.1093/bioinformatics/btz575

Jouni Sirén, Richard Durbin + Show 4 more

Open Access

https://doi.org/10.1093/bioinformatics/btz575

Copy DOI

Abstract

MotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.Availability and implementationOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species
We develop the Graph BWT (GBWT), a scalable implementation of the graph extension of the positional Burrows–Wheeler transform [4, 21], to store the haplotypes as paths in the graph
The GBWT supports the following variants of the standard FM-index queries: find(X) returns the lexicographic range of reverse prefixes starting with the reverse pattern. locate(sp, ep) returns the haplotype identifiers DA[sp, ep]

Summary

Introduction

Sequence analysis pipelines often start by mapping the sequence reads to a reference genome of the same species. Because individual genomes are similar, compressed text indexes can store such collections in very little space [18] Due to this similarity, most reads map well to many haplotypes. Graphs such as de Bruijn graphs collapse sequences by local similarity instead of global alignment They are better suited to handling structural variation than DAGs. the lack of a global coordinate system limits their usefulness as references. Because they collapse sequences between variants, they represent both the original haplotypes and their recombinations, that is paths that switch between haplotypes. VG handles complex graph regions by indexing a simplified graph, the final alignment is done in the original graph The drawback of this approach is that simplification can break paths corresponding to known haplotypes, while leaving paths representing recombinations intact.

Strings and graphs

FM-index

Positional BWT

Graph extension

Records

GBWT encodings

GBWT construction

Basic construction

Construction in VG

Haplotype-aware graph simplification

Experiments

GBWT benchmarks

Haplotype-aware graphs

Discussion

Findings

10 Simon Gog et al From theory to practice

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Jul 26, 2019
Citations: 74	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Haplotype-aware graph indexes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Haplotype-aware graph indexes
...
-
, et. al. ...
01 Jan 2018
01 Jan 2018

Abstract 3570: Germline structural variant detection with variation graphs
Eric T Dawson ... Glenn Hickey
Cancer Research | VOL. 77
Eric T Dawson, et. al.Eric T Dawson ... Glenn Hickey
01 Jul 2017
Abstract 3570: Germline structural variant detection with variation graphs
Eric T Dawson ... Glenn Hickey

GRAFIMO: Variant and haplotype aware motif scanning on pangenome graphs
Manuel Tognon ... Erik Garrison
-
Manuel Tognon, et. al.Manuel Tognon ... Erik Garrison
27 Sep 2021
27 Sep 2021

GRAFIMO: Variant and haplotype aware motif scanning on pangenome graphs.
Manuel Tognon ... Mihaela Pertea
PLOS Computational Biology | VOL. 17
Manuel Tognon, et. al.Manuel Tognon ... Mihaela Pertea
27 Sep 2021
PLOS Computational Biology | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Haplotype-aware graph indexes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics