Abstract
Genomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAGs). To achieve this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Furthermore, such graph extensions could have considerable blowup in their size and in the worst case the blow-up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming (DP) formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. With the proposed refinements, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph, and its feedback vertex set. We conducted experiments to compare the proposed algorithm against the existing POA-based techniques. We also performed alignment experiments on the genome variation graphs constructed from the 1000 Genomes data. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high similarity to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.
Highlights
Most state-of-the-art high throughput genome studies rely heavily on high quality reference genome [1]
V-ALIGN is based on a novel dynamic programming formulation that allows gapped alignment with affine, linear or constant gaps directly on the input graph
When the alignment is restricted to a filtered set of subgraphs, which is done for improved efficiency, the V-ALIGN can be used for aligning to these candidate subgraphs
Summary
Most state-of-the-art high throughput genome studies rely heavily on high quality reference genome [1]. Various graph data structures have been studied in the literature for pangenome representation with subtle distinctions [3] These include De Bruijn graphs [7], [8], ABruijn graphs [9], Enredo graphs [10], Cactus graphs [5], [11], Population Reference graphs [6], String graphs [12], and Variation graphs [2]. In variation graphs [2], the common subsequences are encoded as labeled vertices and variations are represented using additional vertices and directed edges. Such representations have shown promise in improved read mapping and variant calling performance [4]. Graph based reference has necessitated the development of graph based computational pipelines for genome analyses [3], [2], [4]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of computational biology : a journal of computational molecular cell biology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.