Choosing proper normalization is essential for discovery of sparse glycan biomarkers.

Hae-Won Uh,Lucija Klarić,Jeanine J Houwing-Duistermaat,Gordan Lauc,Age K Smilde,Ivo Ugrina

doi:10.1039/c9mo00174c

Abstract

Rapid progress in high-throughput glycomics analysis enables the researchers to conduct large sample studies. Typically, the between-subject differences in total abundance of raw glycomics data are very large, and it is necessary to reduce the differences, making measurements comparable across samples. Essentially there are two ways to approach this issue: row-wise and column-wise normalization. In glycomics, the differences per subject are usually forced to be exactly zero, by scaling each sample having the sum of all glycan intensities equal to 100%. This total area (row-wise) normalization (TA) results in so-called compositional data, rendering many standard multivariate statistical methods inappropriate or inapplicable. Ignoring the compositional nature of the data, moreover, may lead to spurious results. Alternatively, a log-transformation to the raw data can be performed prior to column-wise normalization and implementing standard statistical tools. Until now, there is no clear consensus on the appropriate normalization method applied to glycomics data. Nor is systematic investigation of impact of TA on downstream analysis available to justify the choice of TA. Our motivation lies in efficient variable selection to identify glycan biomarkers with regard to accurate prediction as well as interpretability of the model chosen. Via extensive simulations we investigate how different normalization methods affect the performance of variable selection, and compare their performance. We also address the effect of various types of measurement error in glycans: additive, multiplicative and two-component error. We show that when sample-wise differences are not large row-wise normalization (like TA) can have deleterious effects on variable selection and prediction.

Highlights

Amsterdam, Amsterdam, The Netherlands h Department of Statistics, University of Leeds, Leeds, UK † Electronic supplementary information (ESI) available
In this work we systematically investigate the impact of different normalization methods on variable selection using lasso regression with glycan covariates
It is often claimed that glycans are by their nature compositions, and that percentage of glycan species in the whole is biologically relevant information.[28]

Summary

Introduction

It is reported that IgG glycosylation appears to be closely linked with chronological and biological ages.[2,3,4] Prior to exploring the potential of glycan biomarkers of ageing, several steps of data pre-processing are required (Fig. 2). In this paper pre-processing refers to various techniques used for extracting clean data from raw instrumental data, and pre-treatment to methods that transform the cleaned raw data for downstream statistical analysis.[5] Normalization of the raw data is needed to transform glycomics measurements or abundances to comparable scales, and improper normalization methods can significantly impair the data.[6,7] In this work we systematically investigate the impact of different normalization methods on variable selection using lasso regression with glycan covariates

Objectives

Methods

Findings

Conclusion