Statistical Workflow for Feature Selection in Human Metabolomics Data.

Antonelli Antonelli,Deng Deng,Niiranen Niiranen,Pereira Pereira,Tyagi Tyagi,Demler Demler,Jain Jain,Mora Mora,Watrous Watrous,Henglin Henglin,Ovsak Ovsak,Hushcha Hushcha,Cheng Cheng,Kim Kim,Rao Rao,Claggett Claggett,Lagerborg Lagerborg

doi:10.3390/metabo9070143

Abstract

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations.

Highlights

Rapid advances in mass spectrometry (MS) technologies have enabled the generation of large-scale metabolomics data in human studies
Investigations using metabolomics technologies have applied a variety of statistical methods in analyses of datasets containing up to 200 metabolite measures, typically acquired from a targeted metabolomics platform collected from human studies involving tens to hundreds of observations [4,5]
The P value for each separate metabolite test can be considered significant or non-significant based on a P value threshold that is corrected to account for the fact that multiple hypotheses are being tested

Summary

Introduction

Rapid advances in mass spectrometry (MS) technologies have enabled the generation of large-scale metabolomics data in human studies. These technical advances have outpaced the development of statistical methods for handling and analyzing datasets of burgeoning size and complexity [1,2,3]. Investigations using metabolomics technologies have applied a variety of statistical methods in analyses of datasets containing up to 200 metabolite measures, typically acquired from a targeted metabolomics platform collected from human studies involving tens to hundreds of observations [4,5]. Current metabolomics technologies have augmented throughput capacity that can facilitate data collection for thousands of observations per human cohort experiment [7,8]. There are no existing standard protocols for analyzing these increasingly complex metabolomics data

Objectives

Methods

Findings

Conclusion