Abstract
BackgroundCataloguing the distribution of genes within natural bacterial populations is essential for understanding evolutionary processes and the genetic basis of adaptation. Advances in whole genome sequencing technologies have led to a vast expansion in the amount of bacterial genomes deposited in public databases. There is a pressing need for software solutions which are able to cluster, catalogue and characterise genes, or other features, in increasingly large genomic datasets.ResultsHere we present a pangenomics toolbox, PIRATE (Pangenome Iterative Refinement and Threshold Evaluation), which identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds. PIRATE builds upon recent scalable software developments to allow for the rapid interrogation of thousands of isolates. PIRATE clusters genes (or other annotated features) over a wide range of amino acid or nucleotide identity thresholds and uses the clustering information to rapidly identify paralogous gene families and putative fission/fusion events. Furthermore, PIRATE orders the pangenome using a directed graph, provides a measure of allelic variation, and estimates sequence divergence for each gene family.ConclusionsWe demonstrate that PIRATE scales linearly with both number of samples and computation resources, allowing for analysis of large genomic datasets, and compares favorably to other popular tools. PIRATE provides a robust framework for analysing bacterial pangenomes, from largely clonal to panmictic species.
Highlights
For most bacteria the complement of genes for a given species is far greater than the number of genes in any one strain
Differences in methodology lie primarily in the post processing of clusters, Roary uses a single percentage identity threshold for MCL clustering and separates paralogs based upon their neighboring genes and PanX splits paralogous genes using an alignment/tree-based method rather than the CDHIT-BLAST approach used by Pangenome Iterative Refinement And Threshold Evaluation (PIRATE)
We present PIRATE, a toolbox for pangenomic analysis of bacterial genomes, which provides a framework for exploring gene diversity by defining genes using relaxed sequence similarity thresholds
Summary
For most bacteria the complement of genes for a given species is far greater than the number of genes in any one strain. Current approaches define genes on the basis of strict sequence identity thresholds [2,3,7,8], e-value cutoffs [5,6] and bit score ratios [4]. It is difficult to define a single identity threshold beyond which genes cease to belong to the same family. Over-splitting is likely to be especially problematic in vertically acquired core genes that have undergone strong diversifying selection or horizontally acquired accessory genes from multiple source populations which share a distant common ancestor. This can lead to misleading impressions of pangenome size and composition
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.