Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2

Julie E Hernández-Salmerón,Gabriel Moreno-Hagelsieb

doi:10.1186/s12864-020-07132-6

Julie E Hernández-Salmerón, Gabriel Moreno-Hagelsieb

Open Access

https://doi.org/10.1186/s12864-020-07132-6

Copy DOI

Journal: BMC Genomics	Publication Date: Oct 24, 2020
Citations: 49	License type: open-access

Affiliation: Wilfrid Laurier University

Abstract

BackgroundFinding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2.ResultsWe found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing the proteins encoded by evolutionarily distant genomes. The program producing the most similar number of RBH to blastp was diamond ran with the “ultra-sensitive” option. However, this option was diamond’s slowest, with the “very-sensitive” option offering the best balance between speed and RBH results. The speeding up of the programs was much more evident when dealing with eukaryotic genomes, which code for more numerous proteins. For example, lastal took a median of approx. 1.5% of the blastp time to run with bacterial proteomes and 0.6% with eukaryotic ones, while diamond with the very-sensitive option took 7.4% and 5.2%, respectively. Though estimated error rates were very similar among the RBH obtained with all programs, RBH obtained with MMseqs2 had the lowest error rates among the programs tested.ConclusionsThe fast algorithms for pairwise protein comparison produced results very similar to blast in a fraction of the time, with diamond offering the best compromise in speed, sensitivity and quality, as long as a sensitivity option, other than the default, was chosen.

Highlights

Finding orthologs remains an important bottleneck in comparative genomics analyses
Runtimes The computing speeds for finding homologs were plotted for each program relative to blastp
Of all the programs tested, lastal was the fastest (Fig. 1), obtaining results in a median of approximately 1.5% of the blastp time to run with bacterial proteomes (Fig. 1, left) and 0.6% with eukaryotic ones (Fig. 1, right)

Summary

Introduction

Finding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Orthologs are defined as characters that diverge after a speciation event [1] This normally means that, if the characters are genes, they can be thought of as the same genes in different species. Efforts in standardizing methods for the inference of orthology remain in constant evaluation, with over forty web services available to the community [6, 7]. Few of these methods are based on phylogenetic analyses (tree-based approach), which, despite expected to be the most accurate, tend to be computationally intensive and impractical for big databases [8, 9]. Some methods employ pairwise sequence similarity comparisons (graph-based methods) that have been successfully implemented, such as the clusters of orthologous groups (COG) database

Methods

Results

Discussion

Conclusion