Abstract
BackgroundPan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations.ResultsWe present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm.ConclusionsPanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos.
Highlights
Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes
PanDelos has been compared to Roary and EDGAR
It runs under Linux systems and is takes as input genomic data in GFF format
Summary
Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. Homologous genes can be distinguished into paralogous, when homology occurs within the same genome, or orthologous, when homology occurs between different genomes. We call pan-genome content discovery the determination of homologous groups within a collection of genomes. Different mechanisms are involved in gene transmission. Orthology is associated to a “vertical” transmission It happens among genomes in the same lineage and involves most of the genetic contents. “horizontal” transmission occurs between genomes of organisms of different lineages, involving one or few genes. Genes present in every genome are core genes of the pan-genome and they may be involved in essential living functionalities.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have