Abstract

Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA.Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases.Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization.Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity.Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark).Contact: forslund@embl.deSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

  • Modern molecular biology relies on evolutionary conservation of properties between entities such as genes and proteins that are homologous, i.e. share descent from a common ancestor

  • Given the role of homology inference in genome-scale biology, validation and comparative benchmarking of the tools in use is important, even where it is difficult in both theory and practice to construct such

  • The three databases generally agree on homology/non-homology of protein pairs under our domain-based definition

Read more

Summary

Introduction

Modern molecular biology relies on evolutionary conservation of properties between entities such as genes and proteins that are homologous, i.e. share descent from a common ancestor. Through homology relationships (and within them, orthology relationships where common ancestry dates back to a species diversification rather than a gene duplication), insights into molecular function of whole sequences (Bork et al, 1998) or specific sites (Yao et al, 2003), 3D structure (Chothia and Lesk, 1986), (Todd et al, 2001) or context such as regulation can be transferred Such transfer of results from direct experimentation to the components of the vast number of genomes for which only molecular data is available, courtesy of ‘next-generation’ nucleotide sequencing techniques, means homology inference forms a mainstay in bioinformatics research as well as in its applications in organismal, clinical and evolutionary biology. Sequence alignment/homology search/homology scoring methods quickly became overwhelmed by computational complexity as database sizes increased, prompting development of heuristic tools like FASTA (Pearson and Lipman, 1988) or NCBI-BLAST (Altschul et al, 1990) which function fast enough to screen the whole of the known sequence universe for similarity to a novel uncharacterized query

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call