Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Alan Kuhnle,Travis Gagie,Christina Boucher,Giovanni Manzini,Taher Mun,Ben Langmead

doi:10.1089/cmb.2019.0309

Abstract

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

Highlights

The FM-index, which is a compressed subsequence index based on Burrows–Wheeler Transform (BWT), is the primary data structure of the majority of short-read aligners—including Bowtie (Langmead et al, 2008), BWA (Li and Durbin, 2009), and SOAP2 (Li et al, 2009)
We apply our method for indexing partial and whole human genomes and show that it improves over the FMindex-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing
We studied how r-index scales to repetitive texts consisting of many similar genomic sequences, comparing it with Bowtie (Langmead et al, 2008), a traditional FM-index-based aligner, and CHIC (Valenzuela and Makinen, 2017), a Hybrid Index that uses LZ compression to scale to repetitive texts

Summary

Introduction

The FM-index, which is a compressed subsequence index based on Burrows–Wheeler Transform (BWT), is the primary data structure of the majority of short-read aligners—including Bowtie (Langmead et al, 2008), BWA (Li and Durbin, 2009), and SOAP2 (Li et al, 2009) These aligners build an FMindex-based data structure of sequences from a given genomic database and use the index to perform queries that find approximate matches of sequences to the database. If S[i] = S[j], S[i] and S[j] have the same relative order in both lists; otherwise, their relative order in F is the same as their lexicographic order This means that if S[i] is in position p in L assuming arrays are indexed from 0 and 0 denotes lexicographic precedence, in F it is in position ji = j{h: S[h] 0 S[i]}j + j{h: L[h] = S[i], h £ p}j - 1.

Objectives

Methods

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of computational biology : a journal of computational molecular cell biology	Publication Date: Apr 1, 2020
Citations: 47	License type: cc-by

R Discovery Prime

R Discovery Prime

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of computational biology : a journal of computational molecular cell biology

Lead the way for us

Similar Papers

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
Alan Kuhnle ... Ben Langmead
-
Alan Kuhnle, et. al.Alan Kuhnle ... Ben Langmead
01 Jan 2019
01 Jan 2019

Cancer genomics: new software tools making sequencing more accessible.
Pengyuan Liu ... En-Guo Chen
Personalized medicine | VOL. 11
Pengyuan Liu, et. al.Pengyuan Liu ... En-Guo Chen
01 Mar 2014
Personalized medicine | VOL. 11

SOAP2: an improved ultrafast tool for short read alignment
Ruiqiang Li ... Siu-Ming Yiu
Bioinformatics | VOL. 25
Ruiqiang Li, et. al.Ruiqiang Li ... Siu-Ming Yiu
03 Jun 2009
Bioinformatics | VOL. 25

Fast and accurate short read alignment with Burrows–Wheeler transform
Heng Li ... Richard Durbin
Bioinformatics | VOL. 25
Heng Li, et. al.Heng Li ... Richard Durbin
18 May 2009
Bioinformatics | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of computational biology : a journal of computational molecular cell biology