Abstract

Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated, and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through principal component partial R-square (PC-PR2) analysis; (iii) application of linear mixed models to remove unwanted variability, including samples’ originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.

Highlights

  • Metabolomics is a powerful tool for investigating candidate etiological pathways for chronic diseases [1,2,3,4]

  • If the ultimate objective of the study is to identify metabolites associated with, say, alcohol, while controlling for body mass index (BMI), alcohol should be included in matrix Z, but BMI could be included in matrix X, so that the associations are adjusted for BMI

  • Incident cancer cases were identified through a combination of methods including linkage to health insurance records, cancer, and pathology registries and active follow-up through study participants and their next-ofkin [23]

Read more

Summary

Introduction

Metabolomics is a powerful tool for investigating candidate etiological pathways for chronic diseases [1,2,3,4]. Sample types (e.g., serum versus plasma), fasting status of the participant, and any other elements related to sampling conditions, sample treatment, and storage represent preanalytical factors, while analytical factors include information on the organization of samples in batches, the acquisition instrument, the acquisition time (i.e., time at which the sample was assayed), and the laboratory [17]. Correcting for these sources of variations is crucial in order to conduct accurate statistical analyses on pooled datasets

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call