TOPAZ: asymmetric suffix array neighbourhood search for massive protein databases

Alan Medlar,Liisa Holm

doi:10.1186/s12859-018-2290-3

Abstract

BackgroundProtein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics. Its application, however, has become increasingly challenging, due to the exponential growth of protein databases. In order to perform homology search at the required scale, many methods have been proposed as alternatives to BLAST that make an explicit trade-off between sensitivity and speed. One such method, SANSparallel, uses a parallel implementation of the suffix array neighbourhood search (SANS) technique to achieve high speed and provides several modes to allow for greater sensitivity at the expense of performance.ResultsWe present a new approach called asymmetric SANS together with scored seeds and an alternative suffix array ordering scheme called optimal substitution ordering. These techniques dramatically improve both the sensitivity and speed of the SANS approach. Our implementation, TOPAZ, is one of the top performing methods in terms of speed, sensitivity and scalability. In our benchmark, searching UniProtKB for homologous proteins to the Dickeya solani proteome, TOPAZ took less than 3 minutes to achieve a sensitivity of 0.84 compared to BLAST.ConclusionsDespite the trade-off homology search methods have to make between sensitivity and speed, TOPAZ stands out as one of the most sensitive and highest performance methods currently available.

Highlights

Protein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics
We compare the performance of TOPAZ with BLAST [6], DIAMOND [16], Lambda [14], LAST [12] and SANSparallel [11]
While there are many other methods for protein homology search, we focused on methods that have demonstrated good performance in previous benchmarks

Summary

Results

We compare the performance of TOPAZ with BLAST (ver. 2.5.0+) [6], DIAMOND (ver. 0.8.37.99) [16], Lambda (ver. 1.9.2) [14], LAST (ver. 801) [12] and SANSparallel (ver. 2.2) [11]. Program options Where possible, each method was run to output 1000 hits per query sequence with an E-value less than or equal to 1. Timing measurements were taken by running the program twice and using the measurement from the second run to ensure disk access times were not a factor These parameter values were chosen to emphasise the importance of sensitivity, we ran all methods with an E-value threshold 10−9, outputting 100 and 1000 hits (see Additional file 1). That reducing the maxmimum number of hits increased the speed of Lambda and SANSparallel, and a more stringent E-value threshold increased the runtime of LAST To make this a fair test, we ran each method in different modes to trade-off speed and sensitivity.

Background

Method

Discussion and conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 31, 2018
Citations: 4	License type: open-access

R Discovery Prime

R Discovery Prime

TOPAZ: asymmetric suffix array neighbourhood search for massive protein databases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Removal of redundant contigs from de novo RNA-Seq assemblies via homology search improves accurate detection of differentially expressed genes.
Hanako Ono ... Toshinori Kozaki
BMC Genomics | VOL. 16
Hanako Ono, et. al.Hanako Ono ... Toshinori Kozaki
01 Dec 2015
BMC Genomics | VOL. 16

Very Large-Scale Neighborhood Search Techniques in Timetabling Problems
Carol Meyers ... James B. Orlin
-
Carol Meyers, et. al.Carol Meyers ... James B. Orlin
30 Aug 2006
30 Aug 2006

Evaluating the neighborhood, hybrid and reversion search techniques of a simulated annealing algorithm in solving forest spatial harvest scheduling problems
Lingbo Dong ... Zhaogang Liu
Silva Fennica | VOL. 50
Lingbo Dong, et. al.Lingbo Dong ... Zhaogang Liu
01 Jan 2015
Silva Fennica | VOL. 50

Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix.
Kyungtaek Lim ... Kentaro Tomii
Journal of Structural and Functional Genomics | VOL. 17
Kyungtaek Lim, et. al.Kyungtaek Lim ... Kentaro Tomii
01 Dec 2016
Journal of Structural and Functional Genomics | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TOPAZ: asymmetric suffix array neighbourhood search for massive protein databases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics