Abstract

BackgroundProtein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics. Its application, however, has become increasingly challenging, due to the exponential growth of protein databases. In order to perform homology search at the required scale, many methods have been proposed as alternatives to BLAST that make an explicit trade-off between sensitivity and speed. One such method, SANSparallel, uses a parallel implementation of the suffix array neighbourhood search (SANS) technique to achieve high speed and provides several modes to allow for greater sensitivity at the expense of performance.ResultsWe present a new approach called asymmetric SANS together with scored seeds and an alternative suffix array ordering scheme called optimal substitution ordering. These techniques dramatically improve both the sensitivity and speed of the SANS approach. Our implementation, TOPAZ, is one of the top performing methods in terms of speed, sensitivity and scalability. In our benchmark, searching UniProtKB for homologous proteins to the Dickeya solani proteome, TOPAZ took less than 3 minutes to achieve a sensitivity of 0.84 compared to BLAST.ConclusionsDespite the trade-off homology search methods have to make between sensitivity and speed, TOPAZ stands out as one of the most sensitive and highest performance methods currently available.

Highlights

  • Protein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics

  • We compare the performance of TOPAZ with BLAST [6], DIAMOND [16], Lambda [14], LAST [12] and SANSparallel [11]

  • While there are many other methods for protein homology search, we focused on methods that have demonstrated good performance in previous benchmarks

Read more

Summary

Results

We compare the performance of TOPAZ with BLAST (ver. 2.5.0+) [6], DIAMOND (ver. 0.8.37.99) [16], Lambda (ver. 1.9.2) [14], LAST (ver. 801) [12] and SANSparallel (ver. 2.2) [11]. Program options Where possible, each method was run to output 1000 hits per query sequence with an E-value less than or equal to 1. Timing measurements were taken by running the program twice and using the measurement from the second run to ensure disk access times were not a factor These parameter values were chosen to emphasise the importance of sensitivity, we ran all methods with an E-value threshold 10−9, outputting 100 and 1000 hits (see Additional file 1). That reducing the maxmimum number of hits increased the speed of Lambda and SANSparallel, and a more stringent E-value threshold increased the runtime of LAST To make this a fair test, we ran each method in different modes to trade-off speed and sensitivity.

Background
Method
Discussion and conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.