A bioinformatician's guide to the forefront of suffix array construction algorithms

A M S Shrestha,M C Frith,P Horton

doi:10.1093/bib/bbt081

Abstract

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Highlights

The problem of finding the occurrences of a pattern string in a given text is one of the most fundamental computational tasks in bioinformatics
We focus only on linear-time algorithms, and in particular on a recent algorithm called SA-IS proposed by Nong et al [23, 24]
It turns out that exploiting this property leads to more efficient algorithms, as we describe in this article

Summary

Introduction

The problem of finding the occurrences of a pattern string in a given text is one of the most fundamental computational tasks in bioinformatics. One simple and effective data structure is a suffix array, which informally is a list of the starting positions of the suffixes of the text, sorted by their alphabetical order. It is possible to construct a modified suffix array that affords efficient search for all suffixes matching (a prefix of) the pattern ‘[ga]..c’, i.e. any occurrence of g or a followed by a c three positions later.

Objectives

Results

Conclusion