SeqAn An efficient, generic C++ library for sequence analysis

Andreas Döring,Tobias Rausch,David Weese,Knut Reinert

doi:10.1186/1471-2105-9-11

Abstract

BackgroundThe use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.ResultsTo remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use.ConclusionWe anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.

Highlights

The use of novel algorithmic techniques is pivotal to many important problems in life science
To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology
We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis

Summary

Introduction

The use of novel algorithmic techniques is pivotal to many important problems in life science. With entire genomes at hand, large scale analysis algorithms that require considerable computing resources are becoming increasingly important (e.g., Lagan [7], MUMmer [8], MGA [9], Mauve [10]) These tools use slightly different algorithms, most of them require some basic algorithmic components, like suffix arrays, string (page number not for citation purposes). Suboptimal data types and ad-hoc algorithms are frequently employed in practice, or one has to resort to stringing standalone tools together Both approaches may be suitable at times, but it would clearly be much more desirable to use an integrated library of state-of-the-art components that can be combined in various ways, either to develop new applications or to compare alternative implementations. In this article we propose SeqAn, a novel C++ library of efficient data types and algorithms for sequence analysis in computational biology

Methods

Results

Conclusion