Abstract

Genomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAGs). To achieve this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Furthermore, such graph extensions could have considerable blowup in their size and in the worst case the blow-up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming (DP) formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. With the proposed refinements, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph, and its feedback vertex set. We conducted experiments to compare the proposed algorithm against the existing POA-based techniques. We also performed alignment experiments on the genome variation graphs constructed from the 1000 Genomes data. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high similarity to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.

Highlights

  • Most state-of-the-art high throughput genome studies rely heavily on high quality reference genome [1]

  • V-ALIGN is based on a novel dynamic programming formulation that allows gapped alignment with affine, linear or constant gaps directly on the input graph

  • When the alignment is restricted to a filtered set of subgraphs, which is done for improved efficiency, the V-ALIGN can be used for aligning to these candidate subgraphs

Read more

Summary

Introduction

Most state-of-the-art high throughput genome studies rely heavily on high quality reference genome [1]. Various graph data structures have been studied in the literature for pangenome representation with subtle distinctions [3] These include De Bruijn graphs [7], [8], ABruijn graphs [9], Enredo graphs [10], Cactus graphs [5], [11], Population Reference graphs [6], String graphs [12], and Variation graphs [2]. In variation graphs [2], the common subsequences are encoded as labeled vertices and variations are represented using additional vertices and directed edges. Such representations have shown promise in improved read mapping and variant calling performance [4]. Graph based reference has necessitated the development of graph based computational pipelines for genome analyses [3], [2], [4]

Objectives
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call