Abstract

Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.

Highlights

  • PlasClass improved precision at the cost of slightly lower recall and had better overall F1 on the shorter sequence lengths. These short sequences can make up the majority of contigs in metagenomic assemblies, allowing PlasClass to outperform PlasFlow in many settings as shown below

  • We presented the PlasClass algorithm for classifying plasmid sequences

  • We applied the algorithm across a wide range of contexts and showed that in most cases PlasClass outperformed the state-of the-art algorithm PlasFlow

Read more

Summary

Introduction

It uses a set of logistic regression classifiers each trained on sequences of a different length sampled from plasmid and bacterial genome reference sequences. We tested PlasClass on simulated data, on bacterial isolates, on a wastewater plasmidome, and on plasmids assembled from human gut microbiome samples. For shorter sequences, which are the majority of contigs in an assembly, PlasClass achieved better F1 scores than PlasFlow.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call