Direct Superbubble Detection

Fabian Gärtner,Peter F Stadler

doi:10.3390/a12040081

Abstract

Superbubbles are a class of induced subgraphs in digraphs that play an essential role in assembly algorithms for high-throughput sequencing data. They are connected with the remainder of the host digraph by a single entrance and a single exit vertex. Linear-time algorithms for the enumeration superbubbles recently have become available. Current approaches require the decomposition of the input digraph into strongly-connected components, which are then analyzed separately. In principle, a single depth-first search could be used, provided one can guarantee that the root of the depth-first search (DFS)-tree is not itself located in the interior or the exit point of a superbubble. Here, we describe a linear-time algorithm to determine suitable roots for a DFS-forest that is guaranteed to identify the superbubbles in a digraph correctly. In addition to the advantages of a more straightforward implementation, we observe a nearly three-fold gain in performance on real-world datasets. We present a reference implementation of the new algorithm that accepts many commonly-used input formats for digraphs. It is available as open source from github.

Highlights

Bubble structures in a digraph have become the focus of an increasing body of research because of their role in genome assembly and related topics; see, e.g., [1] and the references therein
We show how to retrieve all weak superbubbles of a digraph G that are located within the induced subgraph G [V [r ]] of G
Since cycles and superbubbles are necessarily completely contained within the depth-first search (DFS)-trees, this does not affect the correctness of the algorithm

Summary

Introduction

Bubble structures in a digraph have become the focus of an increasing body of research because of their role in genome assembly and related topics; see, e.g., [1] and the references therein. Proposed superbubbles as an important class of subgraphs in the de Bruijn and overlap digraphs arising in the context of the assembly of high-throughput sequencing data [3,4]. The algorithm identifying all superbubbles in a digraph G with vertex set V and edge set | E| had a running time O(|V |(|V | + | E|)) [2]. A linear time algorithm for an acyclic subgraph together with the construction of auxiliary digraphs along the lines of [5] provided a solution in. An alternative linear-time algorithm [7] achieves a substantial speedup and does not require sophisticated data structures. All these approaches rely on the decomposition

Objectives

Results

Conclusion