Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data

Alejandro R Walker,Susmita Datta

doi:10.1186/s13062-019-0243-z

Abstract

BackgroundMetagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge”, including also samples from three mystery sets. We used appropriate machine learning techniques on this massive dataset to effectively identify the geographical provenance of “mystery” samples. Additionally, we pursued compositional data analysis to develop accurate inferential techniques for such microbiome data. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis.ResultsA preliminary quality screening of the data revealed a much better dataset in terms of Phred quality score (hereafter Phred score), and larger paired-end MiSeq reads, and a more balanced experimental design, though still not equal number of samples across cities. PCA (Principal Component Analysis) analysis showed interesting clusters of samples and a large amount of the variability in the data was explained by the first three components (~ 70%). The classification analysis proved to be consistent across both the testing mystery sets with a similar percentage of the samples correctly predicted (up to 90%). The analysis of the relative abundance of bacterial “species” showed that some “species” are specific to some regions and can play important roles for predictions. These results were also corroborated by the variable importance given to the “species” during the internal cross validation (CV) run with Random Forest (RF).ConclusionsThe unsupervised analysis (PCA and two-way heatmaps) of the log2-cpm normalized data and relative abundance differential analysis seemed to suggest that the bacterial signature of common “species” was distinctive across the cities; which was also supported by the variable importance results. The prediction of the city for mystery sets 1 and 3 showed convincing results with high classification accuracy/consistency. The focus of this work on the current MetaSUB data and the analytical tools utilized here can be of great help in forensic, metagenomics, and other sciences to predict city of provenance of metagenomic samples, as well as in other related fields. Additionally, the pairwise analysis of relative abundance showed that the approach provided consistent and comparable “species” when compared with the classification importance variables.ReviewersThis article was reviewed by Manuela Oliveira, Dimitar Vassilev, and Patrick Lee.

Highlights

Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes
The current version of the CAMDA MetaSUB challenge data was much better with an overall small number of samples for each of the 12 cities
We have expanded our methodology to classify the mystery samples and used better statistical inferential techniques based on the compositional data analysis of microbiome data identifying important differentiating city specific microbes

Summary

Introduction

Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis. We have expanded our methodology to classify the mystery samples and used better statistical inferential techniques based on the compositional data analysis of microbiome data identifying important differentiating city specific microbes. In this context, three more datasets were provided as mystery datasets through the CAMDA 2018 MetaSUB challenge to serve as testing samples for the classification problem. The latter issue can be improved by adding more samples [1] to the dataset or by increasing the sequencing depth [2]

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biology Direct	Publication Date: Jul 24, 2019
Citations: 24	License type: open-access

R Discovery Prime

R Discovery Prime

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biology Direct

Lead the way for us

Similar Papers

Using machine learning techniques for exploration and classification of laboratory data
Inga Trulson ... Georg Hoffmann
Journal of Laboratory Medicine | VOL. 0
Inga Trulson, et. al.Inga Trulson ... Georg Hoffmann
13 Aug 2024
Journal of Laboratory Medicine | VOL. 0

Disentangling data dependency using cross-validation strategies to evaluate prediction quality of cattle grazing activities using machine learning algorithms and wearable sensor data.
Leonardo Augusto Coelho Ribeiro ... Marina De Arruda Camargo Danes
Journal of animal science | VOL. 99
Leonardo Augusto Coelho Ribeiro, et. al.Leonardo Augusto Coelho Ribeiro ... Marina De Arruda Camargo Danes
05 Jul 2021
Journal of animal science | VOL. 99

Dark Matter Classification in Simulated Galaxies Using Machine Learning
Bitha Salimkumar
-
Bitha SalimkumarBitha Salimkumar
24 Feb 2021
24 Feb 2021

Discrimination of Radix Astragali according to geographical regions by data fusion of laser induced breakdown spectroscopy (LIBS) and infrared spectroscopy (IR) combined with random forest (RF)
Yang Wang ... Ya-Qiang Feng
Chinese Journal of Analytical Chemistry | VOL. 50
Yang Wang, et. al.Yang Wang ... Ya-Qiang Feng
01 Mar 2022
Chinese Journal of Analytical Chemistry | VOL. 50

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biology Direct