Abstract
AbstractMarcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph as a description of a pan-genome, comprising the genomes of many individuals/strains of the same or closely related species. Subsequent work improved the construction of the compressed de Bruijn graph in terms of run-time and memory consumption. According to the Computational Pan-Genomics Consortium (Briefings in Bioinformatics 2016), a pan-genome data structure should support the following functionality: “All information within a data structure should be easily accessible for human eyes by visualization support on different scales.” However, a pan-genome graph can have thousands to millions of nodes and such an amount of information is certainly not easily accessible for human eyes. Thus, the possibility to construct pangenome subgraphs on demand would be quite valuable. In this article, we use the space-efficient representation of the compressed de Bruijn graph devised by Beller and Ohle-busch (Algorithms for Molecular Biology 2016) to construct pan-genome subgraphs on the fly. The user can specify a region in one of the genomes and the software tool will build a subgraph that contains the path corresponding to that region and all paths that are in the neighborhood of that path. The size of the neighborhood can be controlled by the user.
Highlights
In the gene-based approach, one distinguishes between to use a compressed de Bruijn graph as a description the core genome that contains genes shared by all strains of a pan-genome, comprising the genomes of many inwithin the clade, the dispensdividuals/strains of the same or closely related species
A k-mer and its reverse complecompressed de Bruijn graph devised by Beller and Ohlement are not represented by the same node because we use busch (Algorithms for Molecular Biology 2016) to construct single strands
A build a subgraph that contains the path corresponding to bi-directed graph representation is required because it is that region and all paths that are in the neighborhood of not known a priori from which strand a read originated
Summary
We have defined uncompressed and compressed de Bruijn graphs. Figure 1 shows the uncompressed de Bruijn graph of the strings S1 = ACGAATCACCAA, S2 = ACGAATCAGCAA, and S3 = GCGAATCTTTCTTTTCAA for k = 3, while Figure 2 shows the corresponding compressed graph. We define the compressed de Bruijn subgraph relative to R with depth d to be the compressed de Bruijn graph containing all nodes u satisfying dist(u, v) ≤ d for a node v ∈ R.4. 1. If there is a path of suitable length from u to a node v in R, u will be in the subgraph. 2. If there is a path of suitable length from a node v in R to u, u will not be in the subgraph (unless case applies). If there is a path of suitable length from a node v in R to u, u will not be in the subgraph (unless case applies) We use this definition of subgraph because our construction algorithm is based on a backward search procedure.. The memory requirements would be much higher because two index data structures are necessary: the wavelet tree of the BWT of S and the wavelet tree of the BWT of the reverse of S
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have