Abstract
Patients with inflammatory bowel disease (IBD) wait months and undergo numerous invasive procedures between the initial appearance of symptoms and receiving a diagnosis. In order to reduce time until diagnosis and improve patient wellbeing, machine learning algorithms capable of diagnosing IBD from the gut microbiome’s composition are currently being explored. To date, these models have had limited clinical application due to decreased performance when applied to a new cohort of patient samples. Various methods have been developed to analyze microbiome data which may improve the generalizability of machine learning IBD diagnostic tests. With an abundance of methods, there is a need to benchmark the performance and generalizability of various machine learning pipelines (from data processing to training a machine learning model) for microbiome-based IBD diagnostic tools. We collected fifteen 16S rRNA microbiome datasets (7,707 samples) from North America to benchmark combinations of gut microbiome features, data normalization and transformation methods, batch effect correction methods, and machine learning models. Pipeline generalizability to new cohorts of patients was evaluated with two binary classification metrics following leave-one-dataset-out cross (LODO) validation, where all samples from one study were left out of the training set and tested upon. We demonstrate that taxonomic features processed with a compositional transformation method and batch effect correction with the naive zero-centering method attain the best classification performance. In addition, machine learning models that identify non-linear decision boundaries between labels are more generalizable than those that are linearly constrained. Lastly, we illustrate the importance of generating a curated training dataset to ensure similar performance across patient demographics. These findings will help improve the generalizability of machine learning models as we move towards non-invasive diagnostic and disease management tools for patients with IBD.
Highlights
The human gut microbiome is a collection of microbes, viruses and fungi residing throughout the digestive tract
Our benchmark provides practical suggestions for ways to improve the performance of an inflammatory bowel disease (IBD) diagnostic test using the gut microbiome composition
Genus abundance estimates from 16S rRNA sequencing need to be normalized by a compositional transformation method, with centered log ratio (CLR) transformation being the most appropriate as it allows for each feature’s importance to the machine learning (ML) models decision to be assessed
Summary
The human gut microbiome is a collection of microbes, viruses and fungi residing throughout the digestive tract. Alterations in the gut microbiome have been linked to illnesses such as multiple sclerosis, type II diabetes, and inflammatory bowel disease (IBD) (Gevers et al, 2014; Opazo et al, 2018). The prevalence of IBD is increasing globally over the last several decades, from 79.5 to 84.3 per 100,000 people between 1990 and 2017, with Canada having among the highest IBD rates at 700 per 100,000 people in 2018 (Benchimol et al, 2019; GBD 2017 Inflammatory Bowel Disease Collaborators, 2020). The disease etiology is currently undetermined, the increasing rates of IBD have been linked to lifestyle factors, such as a Western diet (Rizzello et al, 2019)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.