Suffix tree searcher: exploration of common substrings in large DNA sequence sets.

David Minkley,Chris Upton,Michael J Whitney,Marina G Barsky,Chris Kelly,Song-Han Lin

doi:10.1186/1756-0500-7-466

David Minkley, Chris Upton + Show 4 more

Open Access

https://doi.org/10.1186/1756-0500-7-466

Copy DOI

Abstract

BackgroundLarge DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow.ResultsSuffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset.ConclusionsThus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.

Highlights

Large DNA sequence data sets require special bioinformatics tools to search and compare them
For example: 1) in long DNA sequences, rearrangements, including transpositions and inversions, can make alignments impossible, 2) gene predictions may miss annotating some genes and promoters, 3) motif searches are too often performed using pre-existing databases of sequence patterns. To supplement these approaches in genome analysis, we have been investigating a different type of query, one that searches for short DNA sequences that are shared among a variety of long DNA sequences without the need for the sequences to be aligned
It is important to note that the underlying C programs are hidden by the graphical user interface (GUI), so that in normal usage the user does not require any knowledge of their implementation

Summary

Introduction

Large DNA sequence data sets require special bioinformatics tools to search and compare them. For example: 1) in long DNA sequences, rearrangements, including transpositions and inversions, can make alignments impossible, 2) gene predictions may miss annotating some genes and promoters (due to sequencing errors or poorly annotated reference genomes), 3) motif searches are too often performed using pre-existing databases of sequence patterns (no potential to find novel patterns). To supplement these approaches in genome analysis, we have been investigating a different type of query, one that searches for short DNA sequences that are shared among a variety of long DNA sequences without the need for the sequences to be aligned. After reviewing an approach akin to short-read alignment in which the “short-reads” would be extracted from a long query sequence, we considered a suffix-tree approach because this seemed likely to work better when large numbers of sequences needed comparing

Methods

Results

Conclusion