Critical comparison of methods for fault diagnosis in metabolomics data

M. Koeman,L. Buydens,J. Engel,J. Jansen

doi:10.1038/s41598-018-37494-7

M. Koeman, L. Buydens + Show 2 more

Open Access

https://doi.org/10.1038/s41598-018-37494-7

Copy DOI

Journal: Scientific Reports	Publication Date: Feb 4, 2019
Citations: 7	License type: open-access

Affiliation: Radboud University Nijmegen

Abstract

Platforms like metabolomics provide an unprecedented view on the chemical versatility in biomedical samples. Many diseases reflect themselves as perturbations in specific metabolite combinations. Multivariate analyses are essential to detect such combinations and associate them to specific diseases. For this, usually targeted discriminations of samples associated to a specific disease from non-diseased control samples are used. Such targeted data interpretation may not respect the heterogeneity of metabolic responses, both between diseases and within diseases. Here we show that multivariate methods that find any set of perturbed metabolites in a single patient, may be employed in combination with data collected with a single metabolomics technology to simultaneously investigate a large array of diseases. Several such untargeted data analysis approaches have been already proposed in other fields to find both expected and unexpected perturbations, e.g. in Statistical Process Control. We have critically compared several of these approaches for their sensitivity and their correct identification of the specifically perturbed metabolites. Also a new approach is introduced for this purpose. The newly introduced Sparse Mean approach, which we find here as most sensitive and best able to identify the specifically perturbed metabolites, turns metabolomics into an untargeted diagnostic platform. Aside from metabolomics, the proposed approach may greatly benefit fault diagnosis with untargeted analyses in many other fields, such as Industrial Process Control, food Adulteration Detection, and Intrusion Detection.

Highlights

MSPC (Multivariate Statistical Process Control) type approaches[1,2] use a multivariate model to describe data from observations that have been defined as ‘normal’ or control
Classical MSPC1 is performed by applying Principal Component Analysis (PCA) to data of a set of control samples
Potentially abnormal, samples can be tested by projection onto this PCA model and either subspaces can be tested for compliance with those of the control samples with a suitable statistical test

Summary

Introduction

MSPC (Multivariate Statistical Process Control) type approaches[1,2] use a multivariate model to describe data from observations that have been defined as ‘normal’ or control. Applications include Industrial Process Control[3], Adulteration Detection[4], Intrusion Detection[5] and Health Monitoring[6,7] These are all situations where the training “group” of abnormal samples contains no, or relatively few, observations compared to the large number of control observations, a scenario where two or multi-class classification performs poorly. One-class modeling allows for detection of heterogeneous abnormalities where a phenomenon may cause different reactions in different observations This is especially beneficial in the case of the aforementioned Health Monitoring which applies MSPC to metabolomics data to identify molecular perturbations due to possibly heterogeneous diseases. Samples that deviate significantly from the known control samples (i.e. outliers) are further investigated using fault diagnosis approaches, that quantify and decompose the deviation observed in either of the two subspaces into variable-specific contributions. There is the ‘smearing effect’[11,12], which may diagnose non-abnormal variables as false positives, due to the geometry of the aforementioned partition of variability in the dimension reduction, which may greatly limit the interpretability of the contribution plots in terms an interpretable root cause

Methods

Results

Discussion

Conclusion