OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Victor Rossier,Alex Warwick Vesztrocy,Christophe Dessimoz,Marc Robinson-Rechavi,Inanc Birol

doi:10.1093/bioinformatics/btab219

Victor Rossier, Alex Warwick Vesztrocy + Show 3 more

Open Access

https://doi.org/10.1093/bioinformatics/btab219

Copy DOI

Abstract

MotivationAssigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.ResultsHere, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.Availabilityand implementationOMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.Supplementary informationSupplementary data are available at Bioinformatics online.

Highlights

Assigning new sequences to known protein families is a prerequisite for many comparative and evolutionary analyses (Glover et al, 2019)
Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND
We show that by adopting efficient alignment-free kmer based analyses pioneered by metagenomic taxonomic classifiers such as Kraken or RAPPAS (Linard et al, 2019; Wood and Salzberg, 2014), and adapting them to protein subfamily-level classification, OMAmer is computationally faster and more scalable than DIAMOND

Summary

Introduction

Assigning new sequences to known protein families is a prerequisite for many comparative and evolutionary analyses (Glover et al, 2019). When gene duplication events have resulted in multiple copies per species, multiple ‘subfamilies’ are generated, which can make placing a protein sequence into the correct subfamily challenging. Gene subfamilies are nested gene families defined after duplication events and organized hierarchically into gene trees. The epsilon and gamma hemoglobin subfamilies are defined at the placental level, and nested in the adult hemoglobin beta subfamily at the mammal level (Opazo et al, 2008). Both belong to the globin family that originated in the LUCA (last universal common ancestor of cellular life)

Methods

Results

Conclusion