Abstract

BackgroundMetagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples.ResultsTo distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label.ConclusionOur results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity.ReviewersThis article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul.

Highlights

  • Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems

  • Read-based machine learning prediction For the fast turnaround time of running MetaPhlAn2 with 223 primary data set from eight cities, we used both multi-threaded option provided in MetaPhlAn2 and multi-job submission script to run the MetaPhlAn2 jobs in parallel in our many-node cluster

  • We investigated linear discriminant analysis (LDA) and random forest (RF) machine learning techniques

Read more

Summary

Introduction

Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. Metagenomic profiling has been explored as a function of microbial impact on human health and diseases. This exploration exists as a function of direct analysis of human derived samples and samples of the Harris et al Biology Direct (2019) 14:12 human occupied environment. By 2012, after generating over 5000 samples and 3.5 terabasepairs (Tbp) of next-generation sequencing (NGS) data, the HMP identified trends in the structure of human microbiome, and an incredible amount of diversity [4, 5] This diversity stems from multiple backgrounds of human samples relative to phenotype, lifestyle, and country of origin [6,7,8]. Changes in the human microbiome have been associated with Clostridioides difficile infection [9,10,11], bacterial vaginosis [12,13,14,15], Parkinson’s disease [16], and potentially even commonplace challenges with mental health [17, 18]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call