Abstract

The advent of next-generation sequencing technologies allowed relative quantification of microbiome communities and their spatial and temporal variation. In recent years, supervised learning (i.e., prediction of a phenotype of interest) from taxonomic abundances has become increasingly common in the microbiome field. However, a gap exists between supervised and classical unsupervised analyses, based on computing ecological dissimilarities for visualization or clustering. Despite this, both approaches face common challenges, like the compositional nature of next-generation sequencing data or the integration of the spatial and temporal dimensions. Here we propose a kernel framework to place on a common ground the unsupervised and supervised microbiome analyses, including the retrieval of microbial signatures (taxa importances). We define two compositional kernels (Aitchison-RBF and compositional linear) and discuss how to transform non-compositional beta-dissimilarity measures into kernels. Spatial data is integrated with multiple kernel learning, while longitudinal data is evaluated by specific kernels. We illustrate our framework through a single point soil dataset, a human dataset with a spatial component, and a previously unpublished longitudinal dataset concerning pig production. The proposed framework and the case studies are freely available in the kernInt package at https://github.com/elies-ramon/kernInt.

Highlights

  • The microbiome is defined as the ensemble of microorganisms and their genomes in a given environment

  • In some reports that compare the performance of different supervised methods in microbiome data, support vector machines (SVM) often appear along random forests (RF) or artificial neural networks (ANN) (Qu et al, 2019; Zhou and Gallins, 2019; Namkung, 2020)

  • Kernel methods were mostly used in an isolated way, without exploiting the kernel framework ability to integrate a great range of analyses while giving a unitary view

Read more

Summary

Introduction

The microbiome is defined as the ensemble of microorganisms and their genomes in a given environment. Betadiversity measures, e.g., Bray-Curtis or Unifrac, quantify the difference in diversity between samples from different habitats They are used for clustering analysis or, more commonly, for visualization techniques like principal coordinates analysis (PCoA) or multidimensional scaling (MDS). This approach has been challenged, as the abundance data obtained by NGS has a particular nature. Library size is uninformative because it does not contain information about the population Instead, it is arbitrarily fixed by the sequencing process and may vary by orders of magnitude across samples (McMurdie and Holmes, 2014). One example is the proposal of using the compositional Aitchison distance instead of the classic beta-diversity measures (Quinn et al, 2018)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call