SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.

Matija Korpar,Dino Blažeka,Mile Šikić,Martin Šošić,Narcis Fernandez-Fuentes

doi:10.1371/journal.pone.0145857

Matija Korpar, Dino Blažeka + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0145857

Copy DOI

Abstract

In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases

Highlights

Searching for protein homologues has become a daily routine for many biologists
To systematically compare the performance of SW#db with BLASTP, SSW, CUDASW++ and SSEARCH, we used a list of proteins of various lengths (Table 1) and the ASTRAL dataset [15] as queries and Swiss-prot and Uniref90 as databases
We managed to run this version on a configuration with older NVIDIA GTX690 cards and it running times were similar to the running times of SW#db almost for all protein lengths, except for the lengths longer than 20000 residues where it was slightly faster (S2 Fig)

Summary

Introduction

Searching for protein homologues has become a daily routine for many biologists. Popular BLAST tools (PSI/DELTA/BLASTP) [1,2,3] produce search results for a single query in less than a second and many bioinformatical tools have come to depend upon the BLAST tool family to find matches in the database of sequences. Protein sequence databases are growing at an unprecedented pace and we would often like to find homologous of not one, but hundreds, thousands, or more queries. The extensive time cost of such a search can hinder the research. BLAST family of tools, not being naturally parallelisable, is unable to utilize the development of new hardware focused on low level parallelism

Methods

Results

Conclusion