Abstract

BackgroundHomology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity.ResultsIn this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs.ConclusionsThe study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.Electronic supplementary materialThe online version of this article (doi:10.1186/s12862-016-0684-2) contains supplementary material, which is available to authorized users.

Highlights

  • Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology

  • We compared clusters inferred from applying clustering algorithms on homologs from GFC with semi-manually curated pillars, sets of orthologs and ohnologs determined by Yeast Gene Order Browser (YGOB) [38] on complete genomes of a fungal dataset

  • GFC consistently performed better than Neighborhood Correlation (NC) for each clustering algorithm for all datasets with varying synteny and similarity conservation, indicating that synteny can improve inference. This is consistent with other studies [10, 34, 35], which shows that gene order conservation is extra information that can aid gene sequence conservation in inferring orthologs more accurately

Read more

Summary

Introduction

Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Ali et al BMC Evolutionary Biology (2016) 16:120 using local [7] or global [8] similarity depends on a model that does not take insertion of domains into account. This was recognized by Song et al [9], who suggested a definition for “multidomain homology”. They stated that homologous proteins follow vertical inheritance and inserted domains (which are seen as horizontally transferred from another protein) should be discounted for. Identifying vertical inheritance is difficult, but they proposed a proxy based on a statistical analysis of conserved domain architecture

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call