Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Richard Durbin

doi:10.1093/bioinformatics/btu014

Abstract

Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms.Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows–Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM2) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing.Availability: http://github.com/richarddurbin/pbwtContact: richard.durbin@sanger.ac.uk

Highlights

Given a large collection of aligned genetic sequences, or haplotypes, it is often of interest to find long matches between sequences within the collection, or between a new test sequence and sequences from the collection
First we notice that in the matching processes we do not use directly the ak1⁄2i indexes, but rather the yi1⁄2k 1⁄4 xak1⁄2i1⁄2k values. These are a permutation of the xi1⁄2k values determined by the ak permutation indicating the sort order at k of prefixes up to position (k À 1), and are a positional analogue of the Burrows–Wheeler Transform (BWT) of X (Burrows and Wheeler, 1994; see Li and Durbin (2009) for an explanation closer to that given here if this is not familiar)
An initial implementation pbwt of the key algorithms was produced. This uses single byte run length encoding for the positional Burrows–Wheeler transform (PBWT), with the top bit encoding the value, the two bits selecting whether the length is in units of 1, 64 or 2048, and the remaining 5 bits giving the number of units

Summary

INTRODUCTION

Given a large collection of aligned genetic sequences, or haplotypes, it is often of interest to find long matches between sequences within the collection, or between a new test sequence and sequences from the collection. By keeping a running match score to find maximal matches as in BLAST, it is straightforwardly possible to reduce this to O(NM) per single test, and so OðNM2Þ across the whole collection, but this is still large for large M. Suffix-array-based methods have proved powerful in standard sequence matching, as exemplified by Bowtie (Langmead et al, 2009), BWA (Li and Durbin, 2009) and SOAP2 (Li et al, 2009). An approach based on suffix arrays is described that can find best matches within a set of sequences in O(NM) time, following preprocessing of the dataset in O(NM) time, and empirically best single haplotype matches in O(N) time. The differences between the algorithms described here and standard suffix array based sequence matching are derived from the fact that there are many sequences that are all of the same length and already aligned. On the one hand there is no need to consider offsets of the test sequence with respect to the sequences in the collection, but on the other hand the test sequence is long and we are looking for maximal matches of an arbitrary substring of the test sequence, not of the whole test sequence

APPROACH

Derivation of prefix array representation

Finding all matches within X longer than a minimum length L

Finding all set-maximal matches within X in linear time

Finding all set-maximal matches from a new sequence z to X

Compact representation of X

RESULTS

DISCUSSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Jan 9, 2014
Citations: 393	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

A comparison of BWT approaches to string pattern matching
Andrew Firth ... Don Adjeroh
Software: Practice and Experience | VOL. 35
Andrew Firth, et. al.Andrew Firth ... Don Adjeroh
01 Jan 2004
Software: Practice and Experience | VOL. 35

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.
Simone Rubinacci ... Travis Gagie
Bioinformatics (Oxford, England) | VOL. 39
Simone Rubinacci, et. al.Simone Rubinacci ... Travis Gagie
02 Sep 2023
Bioinformatics (Oxford, England) | VOL. 39

Parallel and memory-efficient Burrows-Wheeler transform
Kenjiro Taura ... Shinya Hayashi
-
Kenjiro Taura, et. al.Kenjiro Taura ... Shinya Hayashi
01 Oct 2013
01 Oct 2013

Linear-time String Indexing and Analysis in Small Space
Juha Kärkkäinen ... Djamal Belazzougui
ACM Transactions on Algorithms | VOL. 16
Juha Kärkkäinen, et. al.Juha Kärkkäinen ... Djamal Belazzougui
09 Mar 2020
ACM Transactions on Algorithms | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics