Abstract

The biology of bacterial cells is, in general, based on information encoded on circular chromosomes. Regulation of chromosome replication is an essential process that mostly takes place at the origin of replication (oriC), a locus unique per chromosome. Identification of high numbers of oriC is a prerequisite for systematic studies that could lead to insights into oriC functioning as well as the identification of novel drug targets for antibiotic development. Current methods for identifying oriC sequences rely on chromosome-wide nucleotide disparities and are therefore limited to fully sequenced genomes, leaving a large number of genomic fragments unstudied. Here, we present gammaBOriS (Gammaproteobacterial oriCSearcher), which identifies oriC sequences on gammaproteobacterial chromosomal fragments. It does so by employing motif-based machine learning methods. Using gammaBOriS, we created BOriS DB, which currently contains 25,827 gammaproteobacterial oriC sequences from 1,217 species, thus making it the largest available database for oriC sequences to date. Furthermore, we present gammaBOriTax, a machine-learning based approach for taxonomic classification of oriC sequences, which was trained on the sequences in BOriS DB. Finally, we extracted the motifs relevant for identification and classification decisions of the models. Our results suggest that machine learning sequence classification approaches can offer great support in functional motif identification.

Highlights

  • Before every cell division, bacteria need to duplicate their genetic material to ensure that this information can faithfully be passed on to both daughter cells

  • Using publicly available Gammaproteobacterial chromosomal fragments as input for gammaBOriS, we gathered the largest dataset of bacterial oriC sequences available to date, BOriS DB

  • GammaBOriS is composed of three modules that were adjusted for and trained on a training set of Gammaproteobacterial oriC sequences (Fig. 1)

Read more

Summary

Introduction

Bacteria need to duplicate their genetic material to ensure that this information can faithfully be passed on to both daughter cells. Some k-mer-SVMs use models of DNA models that allow mismatches or gaps while performing k-mer counting, taking into account the effect of natural variation[35,36,37] Most of these machine learning models can produce a list of features important for the classification task, which is, in this case, a list of most relevant motifs. We present a machine-learning based approach for the study of bacterial oriC sequences in four parts, exemplified on Gammaproteobacteria This class of organisms contains many model organisms (e.g., Escherichia coli, Vibrio cholerae, and Pseudomonas putida), and causative agents for serious illnesses (such as cholera, plague, and enteritis), which makes this taxon a highly relevant study object. We present a list of motifs that were important for the identification and classification and show that the machine learning models presented here were able to learn biologically relevant information from the DNA sequences presented to them

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.