Wheeler graphs: A framework for BWT-based data structures

Travis Gagie,Giovanni Manzini,Jouni Sirén

doi:10.1016/j.tcs.2017.06.016

Abstract

The famous Burrows–Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more.We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly.We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.

Highlights

The Burrows–Wheeler Transformation (BWT) has a very peculiar history
After its introduction as a compression tool, interest in the BWT was rekindled when many researchers realized that, among the different techniques discovered at the turn of the century for designing compressed indexes [19,29,34], those based on the BWT are probably the simplest and most space efficient [15,43]
We show that the state graphs associated to these automata have common properties that we summarize with the concept of Wheeler graphs

Summary

Introduction

The Burrows–Wheeler Transformation (BWT) has a very peculiar history. First conceived in 1983, it was published only eleven years later in a technical report [9], presumably because it was so innovative that the first reviewers were not able to grasp its full significance. After its introduction as a compression tool, interest in the BWT was rekindled when many researchers realized that, among the different techniques discovered at the turn of the century for designing compressed indexes [19,29,34], those based on the BWT are probably the simplest and most space efficient [15,43] After this realization, in the last ten years we have witnessed an unusual phenomenon in computer science: variants of the BWT have been proposed and applied to more and more complex objects: from trees, to graphs, to alignments.

Definitions and basic results

Multi-string BWT and permuterm index

XBWT and trie representation

A AC A A

FM-index of alignment

Conclusions and future work