Abstract

BackgroundIdentification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.ResultsTo address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.ConclusionsWe clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.

Highlights

  • Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics

  • We represent a pan-genome by a hierarchy of genome, annotation and proteome layers stored in a Neo4j graph database to connect different types of data (Fig. 1)

  • We tested scalability on 5 datasets of increasing size compiled from 93 Saccharomyces cerevisiae strains [20] and 5 datasets compiled from 19 Arabidopsis thaliana accessions [21]

Read more

Summary

Results

We present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. Scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa

Conclusions
Background
Methods
Results and discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call