Abstract

BackgroundClustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential.ResultsHere, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data.ConclusionsThe results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

Highlights

  • Clustering sequences into families has long been an important step in characterization of genes and proteins

  • We used default parameters setting for all other options and turned off parameters related to Gene Inversion, Lateral Gene Transfer (LGT), Fission, Fusion and Pseudogenization events without loss of generality

  • Methodologies only based on similarity have long been proposed for homology inference without taking account of synteny

Read more

Summary

Introduction

Clustering sequences into families has long been an important step in characterization of genes and proteins. Gene family classification is an important pre-requisite in Bioinformatics studies and enables, e.g., phylogenetic and structural analysis. Due to the importance of gene family classification, it has become one of the most active fields of research in Bioinformatics and bioinformaticians have employed different algorithms to detect homology and to partition detected homologs into gene families. The pioneers of homology inference algorithms use similarity-based methods, typically employing BLAST [2,3] as a subroutine, like Reciprocal Bidirectional Hits (RBH) [4] and Clusters of Orthologous Groups (COGs) [5]. The class of algorithms use sequence clustering techniques and examines a wide range of BLAST hits. Infer homologs by extracting evidence from network structure of BLAST hits or multiple sequence alignments The generation of homology inference algorithms improved the accuracy and the time and/or memory complexity requirements and include algorithms like Neighborhood Correlation [16], HiFiX [17], PHYRN [18], COCO-CL [19] etc. and infer homologs by extracting evidence from network structure of BLAST hits or multiple sequence alignments

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call