Abstract

Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called ‘mlplasmids’. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.

Highlights

  • Plasmids are autonomous extra-chromosomal elements that can act as major drivers of variation and adaptation in bacterial populations [1, 2]

  • To ensure that our training and test sets contained chromosome- and plasmid-derived contigs from a diverse set of isolates belonging to each species, we estimated the diversity present in our collection of K. pneumoniae, E. coli and E. faecium genomes with Mash [24]

  • We investigated the applicability of genomic signatures to distinguish between plasmid- and chromosome-derived sequences by calculating the pentamer frequencies from complete chromosomal and plasmid sequences of E. faecium, K. pneumoniae and E. coli available in the National Center for Biotechnology Information (NCBI) database

Read more

Summary

Introduction

Plasmids are autonomous extra-chromosomal elements that can act as major drivers of variation and adaptation in bacterial populations [1, 2]. Illumina sequencing platforms, which provide short reads (ranging from 150 to 300 bp) with low error rates, have been massively used to assemble bacterial draft genomes [9]. The frequent presence of insertionsequences (IS) and transposable elements in bacterial genomes prohibit their full assembly, because these regions cannot be spanned by short-reads [7, 10]. This results in a fragmented assembly typically consisting of hundreds of chromosomal and plasmid contigs that challenge the inference of the origin of these contigs

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.