Fast read alignment with incorporation of known genomic variants

Hongzhe Guo,Bo Liu,Yilei Fu,Yadong Wang,Dengfeng Guan

doi:10.1186/s12911-019-0960-3

Abstract

BackgroundMany genetic variants have been reported from sequencing projects due to decreasing experimental costs. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. However, storing and indexing various types of variation require costly RAM space.MethodsAligning reads to a graph model-based index including the whole set of variants is ultimately an NP-hard problem in theory. Here, we propose a variation-aware read alignment algorithm (VARA), which generates the alignment between read and multiple genomic sequences simultaneously utilizing the schema of the Landau-Vishkin algorithm. VARA dynamically extracts regional variants to construct a pseudo tree-based structure on-the-fly for seed extension without loading the whole genome variation into memory space.ResultsWe developed the novel high-throughput sequencing read aligner deBGA-VARA by integrating VARA into deBGA. The deBGA-VARA is benchmarked both on simulated reads and the NA12878 sequencing dataset. The experimental results demonstrate that read alignment incorporating genetic variation knowledge can achieve high sensitivity and accuracy.ConclusionsDue to its efficiency, VARA provides a promising solution for further improvement of variant calling while maintaining small memory footprints. The deBGA-VARA is available at: https://github.com/hitbc/deBGA-VARA.

Highlights

An accurate and complete understanding of genetic variation is important in research on human disease [1,2,3]
Complex regions consists a lot of biologically valuable single mutations and structural variants, e.g., the major histocompatibility complex (MHC) region that occurs on human chromosome 6, which includes the human leukocyte antigen (HLA) gene families
The variation graph toolkit developed by Garrison et al [26] utilized the GCSA2 library [23] to perform read mapping to an arbitrary variation graph and improve accuracy over linear references at the expense of large RAM usage, e.g., the 75 GB RAM theoretical requirement of the GRCh37 linear reference and the variant set produced in the 1000 Genomes Project (1000 GP) phase3 [27]

Summary

Introduction

An accurate and complete understanding of genetic variation is important in research on human disease [1,2,3]. A fundamental challenge of high-throughput sequencing (HTS) data analysis is accurate read alignment to one or multiple reference genomes. It is proven that with no existing variants, mapping reads directly to a reference genome can have a relatively high quality outcome in regions with low divergence [10]. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. Storing and indexing various types of variation require costly RAM space

Methods

Results

Conclusion