GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data.

Li Chen,Xuefeng Wang,Lujun Zhang,Shengbing Huang,Jun Chen,James Reeve

doi:10.7717/peerj.4600

Abstract

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero-inflation remain largely undeveloped. Here we propose geometric mean of pairwise ratios—a simple but effective normalization method—for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.

Highlights

High-throughput sequencing experiments such as RNA-Seq and microbiome sequencing are routinely employed to interrogate the biological systems at the genome scale (Wang, Gerstein & Snyder, 2009)
Normalization is a critical step in processing microbiome data, rendering multiple samples comparable by removing the bias caused by variable sequencing depths
Normalization paves the way for the downstream analysis, especially for differential abundance analysis (DAA) of microbiome data, where proper normalization could reduce the false positive rates due to compositional effects

Summary

Introduction

High-throughput sequencing experiments such as RNA-Seq and microbiome sequencing are routinely employed to interrogate the biological systems at the genome scale (Wang, Gerstein & Snyder, 2009). After processing of the raw sequence reads, the sequencing data usually presents as a count table of detected features. The complex processes involved in the sequencing causes the sequencing depth (library size) to vary across samples, sometimes ranging several orders of magnitude. Normalization, which aims to correct or reduce the bias introduced by variable library sizes, is an essential preprocessing step before any downstream statistical analyses for high-throughput sequencing experiments (Dillies et al, 2013; Li et al, 2015). An inappropriate normalization method may either reduce statistical power with

Methods

Results

Conclusion