A comparison of machine learning and Bayesian modelling for molecular serotyping

Richard Newton,Lorenz Wernisch

doi:10.1186/s12864-017-3998-6

Abstract

BackgroundStreptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model.ResultsWe compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays.ConclusionsWith the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.

Highlights

Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality
We investigate different approaches to analysing the raw data from a custom genomic microarray that has been designed for molecular serotyping Streptococcus pneumoniae [1, 2]
For the D.36 dataset the Gradient Boosting Machines (GBM) performs much better than the Bayesian Model for samples containing mixtures of serotypes, with the Random Forest performance intermediate between the two

Summary

Introduction

Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Genomic microarrays provide an effective method for molecular serotyping. Several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. We investigate different approaches to analysing the raw data from a custom genomic microarray that has been designed for molecular serotyping Streptococcus pneumoniae [1, 2]. Streptococcus pneumoniae is an important human pathogen and a major cause of infant mortality. Genomic microarrays are a method for detecting the presence or absence of multiple genes within a sample simultaneously, through specific binding to an array of high-density probes. A microarray constructed with probes for genes specific to different strains of an organism can detect the presence of a particular strain of the organism in a clinical sample according to which of the probes have an elevated signal

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Aug 11, 2017
Citations: 3	License type: open-access

R Discovery Prime

A comparison of machine learning and Bayesian modelling for molecular serotyping

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Context- and Physiology-aware Machine Learning for Upper-Limb Myocontrol
Gauravkumar K Patel
-
Gauravkumar K PatelGauravkumar K Patel
21 Feb 2022
21 Feb 2022

Artificial intelligence in interdisciplinary life science and drug discovery research.
Jürgen Bajorath
Future science OA | VOL. 8
Jürgen BajorathJürgen Bajorath
08 Mar 2022
Future science OA | VOL. 8

Plants meet machines: Prospects in machine learning for plant biology
Pamela S Soltis ... Emily K Meineke
Applications in Plant Sciences | VOL. 8
Pamela S Soltis, et. al.Pamela S Soltis ... Emily K Meineke
01 Jun 2020
Applications in Plant Sciences | VOL. 8

Parameter importance assessment improves efficacy of machine learning methods for predicting snow avalanche sites in Leh-Manali Highway, India
Anuj Tiwari ... Bramha Dutt Vishwakarma
Science of the Total Environment | VOL. 794
Anuj Tiwari, et. al.Anuj Tiwari ... Bramha Dutt Vishwakarma
29 Jun 2021
Science of the Total Environment | VOL. 794

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

A comparison of machine learning and Bayesian modelling for molecular serotyping

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Genomics