Calculating orthologs in bacteria and Archaea: a divide and conquer approach.

Mihail R Halachev,Nicholas J Loman,Mark J Pallen,Jonathan H Badger

doi:10.1371/journal.pone.0028388

Abstract

Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/.

Highlights

A central goal in comparative genomics is to identify novel and/ or shared biology between organisms, or at least make informed predictions in this regard
Using the ortholog pairs data, we organized the coding sequences (CDSs) in ortholog groups (OGs) using the single-linkage approach
Each of the generated 15,874 (Archaea) and 159,657 (Bacteria) ortholog groups contains 2 or more CDSs and each CDS belongs to one group only

Summary

Introduction

A central goal in comparative genomics is to identify novel and/ or shared biology between organisms, or at least make informed predictions in this regard. Fitch [1] originally proposed the definition of orthologs as homologous proteins related via speciation. Under this definition of orthologs ‘‘it is both theoretically plausible and empirically supported that due to their sequence similarity they have similar structure and typically perform equivalent biological function’’ [2]. As lines of descent are rarely known, a practical approach for inferring orthology is to compare protein sequences and draw conclusions based on sequence similarity. The existence of co-orthologs, i.e. where a pair of paralogs from one genome is orthologous to a protein or a pair of paralogs from another, can complicate such approaches and requires further consideration

Objectives

Methods

Results

Conclusion