Abstract

MotivationIncreasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet.ResultsWe present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome.Availability and implementationSource code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast.Supplementary informationSupplementary data are available at Bioinformatics online.

Highlights

  • We study the problem of finding high scoring local alignments between a query sequence and a graph that are likely to represent sequence homology

  • Our algorithm finds high scoring local alignments between a given query sequence q and a pangenome represented as a compacted colored de Bruijn graph G = (V, E, λ, C) over the DNA alphabet and a color set U

  • We show the advantage in runtime, memory usage and result aggregation when searching local alignments inside a pangenome with our method compared to a conventional search and analysis using other BLAST-like software tools

Read more

Summary

Introduction

A pangenome is defined as a set of genomic sequences that may be stored and analysed collectively while being represented as a single entity. The pangenomic approach allows a high memory saving potential as sequence parts shared by multiple genomes have to be stored only once. It enables the simultaneous comparison of a large number of individual genomes while avoiding classical reference-based analyses which turned out to have shortcomings in various cases [6, 10]. A method was published allowing exact read mapping on general graphs [36]. Other solutions have been presented by Antipov et al [5] and Kavya et al [22]

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.