An improved encoding of genetic variation in a Burrows-Wheeler transform.

Thomas Büchler,Enno Ohlebusch

doi:10.1093/bioinformatics/btz782

Abstract

In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a 'natural encoding' of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361-i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222-233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the 'marked chromosome'. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. https://www.uni-ulm.de/in/theo/research/seqana/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An improved encoding of genetic variation in a Burrows-Wheeler transform.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Oct 15, 2019
Citations: 6

Similar Papers

Cancer genomics: new software tools making sequencing more accessible.
En-Guo Chen ... Yan Lu
Personalized Medicine | VOL. 11
En-Guo Chen, et. al.En-Guo Chen ... Yan Lu
01 Mar 2014
Personalized Medicine | VOL. 11

SOAP2: an improved ultrafast tool for short read alignment
Ruiqiang Li ... Jun Wang
Bioinformatics | VOL. 25
Ruiqiang Li, et. al.Ruiqiang Li ... Jun Wang
03 Jun 2009
Bioinformatics | VOL. 25

A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference
Sorina Maciuca ... Carlos Del Ojo Elias
-
Sorina Maciuca, et. al.Sorina Maciuca ... Carlos Del Ojo Elias
01 Jan 2015
01 Jan 2015

Ultrafast SNP analysis using the Burrows–Wheeler transform of short-read data
Kouichi Kimura ... Asako Koike
Bioinformatics | VOL. 31
Kouichi Kimura, et. al.Kouichi Kimura ... Asako Koike
20 Jan 2015
Bioinformatics | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An improved encoding of genetic variation in a Burrows-Wheeler transform.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics