Abstract

BackgroundThe advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction.ResultsComparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data.ConclusionsHerein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

Highlights

  • The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction

  • (SG-KB) as well as the table provided by Critical Assessment of Massive Data Analysis (CAMDA) using

  • Using the Metadesign of Subways & Urban Biomes (MetaSUB) data, we demonstrated that heterogeneous experimental protocols used for sample collection and sequencing between cities can have a substantial influence on the prediction

Read more

Summary

Introduction

The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Microbiome studies have demonstrated successes in detecting microbial compositional patterns in health and environmental contexts. 16S ribosomal RNA (rRNA) amplicon sequencing approach which targets and sequences a region within the 16S rRNA gene of bacteria and archaea; and the shotgun whole genome sequencing approach in which all genetic material present in a sample is sequenced. The latter has the potential to allow identification of all manner of species to the strain level as well as allowing for the detection and characterization of functional units such as genes, plasmids, or pathogenicity islands. Despite the pros and cons of each technique, successes in extracting meaningful biological information have been found for disease and environmental studies using both methods [2, 3, 8,9,10,11]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call