Simplivariate Models: Ideas and First Examples

Jos A Hageman,Margriet M W B Hendriks,Age K Smilde,Johan A Westerhuis,Ruud Berger,Mariët J Van Der Werf

doi:10.1371/journal.pone.0003259

Jos A Hageman, Margriet M W B Hendriks + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0003259

Copy DOI

Abstract

One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other high-dimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework ‘simplivariate models’ which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of ‘divide and conquer’, we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of E. coli. Moreover, these simplivariate models were able to uncover regulatory mechanisms present in the phenylalanine biosynthesis route of E. coli.

Highlights

Modern instrumental methods have been generating a significant advancement in biology research
The type of data being generated in metabolomics studies is characterized by a very broad acquisition of semi-quantitative data of a large number of metabolites [1,2,3,4]
We propose a new conceptual framework for analyzing metabolomics data based on the idea to separate informative from non-informative variation

Summary

Introduction

Modern instrumental methods have been generating a significant advancement in biology research. The missing link between these measurements and the phenotype is called metabolomics [1]. The type of data being generated in metabolomics studies is characterized by a very broad acquisition of semi-quantitative data of a large number of metabolites [1,2,3,4]. This results in data sets of a very complex nature. Are these data sets highdimensional, they exhibit mixtures of types of variation introduced by the specific experimental setup [5]. Extensive details on experimental setup, GC-MS and LC-MS analysis and subsequent preprocessing can be found in [14]

Methods

Results

Conclusion