Abstract

Population-level comparisons of prokaryotic genomes must take into account the substantial differences in gene content resulting from horizontal gene transfer, gene duplication and gene loss. However, the automated annotation of prokaryotic genomes is imperfect, and errors due to fragmented assemblies, contamination, diverse gene families and mis-assemblies accumulate over the population, leading to profound consequences when analysing the set of all genes found in a species. Here, we introduce Panaroo, a graph-based pangenome clustering tool that is able to account for many of the sources of error introduced during the annotation of prokaryotic genome assemblies. Panaroo is available at https://github.com/gtonkinhill/panaroo.

Highlights

  • Prokaryotic genome evolution is driven both by the transfer of genetic material vertically from parent to offspring and by horizontal gene transfer between organisms [1]

  • We demonstrate the success of the algorithm through extensive simulation using the Infinitely Many Genes model [22] and by analysing a diverse array of large bacterial genomic datasets including the major clades of the Global Pneumococcal Sequencing (GPS) project [23]

  • Overview Panaroo builds a full graphical representation of the pangenome, where nodes are clusters of orthologous genes (COGs) and two nodes are connected by an edge if they are adjacent on a contig in any sample from the population

Read more

Summary

Introduction

Prokaryotic genome evolution is driven both by the transfer of genetic material vertically from parent to offspring and by horizontal gene transfer between organisms [1]. Large population sequencing studies of bacteria have confirmed that this results in largescale differences in intraspecies genome content [2] This has led to the description of the pangenome, the set of all genes that have been found in a species as a whole [3]. A common problem when inferring the pangenome of bacterial genomic datasets is the classification of homologous genes, usually defined by a percentage shared sequence identity, into either orthologous or paralogous clusters. Orthologs trace their most recent common ancestor to a speciation event whereas paralogs trace their most recent common

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.