Analyzing Large Microbiome Datasets Using Machine Learning and Big Data

Thomas Krause,Jyotsna Talreja Wassan,Haiying Wang,Paul Mc Kevitt,Huiru Zheng,Matthias Hemmje

doi:10.3390/biomedinformatics1030010

Thomas Krause, Jyotsna Talreja Wassan + Show 4 more

Open Access

https://doi.org/10.3390/biomedinformatics1030010

Copy DOI

Abstract

Metagenomics promises to provide new valuable insights into the role of microbiomes in eukaryotic hosts such as humans. Due to the decreasing costs for sequencing, public and private repositories for human metagenomic datasets are growing fast. Metagenomic datasets can contain terabytes of raw data, which is a challenge for data processing but also an opportunity for advanced machine learning methods like deep learning that require large datasets. However, in contrast to classical machine learning algorithms, the use of deep learning in metagenomics is still an exception. Regardless of the algorithms used, they are usually not applied to raw data but require several preprocessing steps. Performing this preprocessing and the actual analysis in an automated, reproducible, and scalable way is another challenge. This and other challenges can be addressed by adjusting known big data methods and architectures to the needs of microbiome analysis and DNA sequence processing. A conceptual architecture for the use of machine learning and big data on metagenomic data sets was recently presented and initially validated to analyze the rumen microbiome. The same architecture can be used for clinical purposes as is discussed in this paper.

Highlights

Current studies are showing the importance and contribution of communities of microorganisms, known as the microbiota, for human development [1], diet–microbiota interactions [2], interactions with the immune system [3,4], and diseases [5,6]
As this paper focuses on demonstrating classification improvements for Machine Learning (ML)
Clustering algorithms on the other hand try to find and group similar data points without using predetermined classes. Another unsupervised type of ML often used in metagenomics that can be distinguished from classification and clustering is dimensionality reduction [13]

Summary

Introduction

Current studies are showing the importance and contribution of communities of microorganisms, known as the microbiota, for human development [1], diet–microbiota interactions [2], interactions with the immune system [3,4], and diseases [5,6]. The traditional way to attempt to answer these and other research questions would be to take samples of the microorganisms from their environment and to culture these in a lab Afterward, they could be studied and compared to other samples to detect similarities or differences in the composition of microorganisms between samples. A single microbiome study can contain hundreds of gigabytes or more of raw sequencing data During processing, this can get multiplied many times as intermediate results in different formats need to be produced. A concern arises in the application of deep learning models to metagenomics classification of phenotypes (linking metagenomic data to observable characteristics of the microorganisms or hosts), where there are more features than samples, which is often the case in predictive modeling of metagenomes.

Structure of Metagenomic Studies

Five Phases of Metagenomic Studies

Example

Machine Learning

Vector Space Transformations

Support Vector Machines

Decision Trees

Random Forest

Naïve Bayes Classifier

Logistic Regression

Clustering Algorithms

Neural Networks

Deep Learning

Role of Machine Learning in Metagenomics

Obtaining Raw Sequence Data

Preprocessing

OTU Clustering

Read Binning

Read Assembly

Taxonomic Annotation

Functional Annotation

Gene Prediction

4.10. Phenotype Classification

4.11. Other Common Analysis Tasks

4.12. Interaction and Perception

Model Selection

Deep Learning and Feature Engineering

Accessibility

Explainability

Reproducibility

Biological Diversity

High Dimensionality and Low Number of Samples

Big Data

Metagenomic Processing Pipelines

Galaxy

MG-RAST and MGnify

QIIME 2

MetaPlat and Successors

AI2VIS4BigData Conceptual Architecture for Metagenomics Supporting

Description of the Conceptual Architecture

Use in Clinical Settings

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BioMedInformatics	Publication Date: Nov 8, 2021
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Analyzing Large Microbiome Datasets Using Machine Learning and Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioMedInformatics

Lead the way for us

Similar Papers

A survey on big data analytics with deep learning in text using machine learning mechanisms
R Anandan ... K Kalaivani
International Journal of Engineering & Technology | VOL. 7
R Anandan, et. al.R Anandan ... K Kalaivani
20 Apr 2018
International Journal of Engineering & Technology | VOL. 7

Machine Learning and Big Data Methodologies for Network Traffic Monitoring

-

01 Jan 2017
01 Jan 2017

Big data phenotyping in rare diseases: some ethical issues
Nina Hallowell ... Christoffer Nellåker
Genetics in Medicine | VOL. 21
Nina Hallowell, et. al.Nina Hallowell ... Christoffer Nellåker
01 Feb 2019
Genetics in Medicine | VOL. 21

M-Health 2.0: New perspectives on mobile health, machine learning and big data analytics.
Robert S.H Istepanian ... Turki Al-Anzi
Methods | VOL. 151
Robert S.H Istepanian, et. al.Robert S.H Istepanian ... Turki Al-Anzi
08 Jun 2018
Methods | VOL. 151

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analyzing Large Microbiome Datasets Using Machine Learning and Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioMedInformatics