Abstract

BackgroundIn shotgun metagenomics, microbial communities are studied through direct sequencing of DNA without any prior cultivation. By comparing gene abundances estimated from the generated sequencing reads, functional differences between the communities can be identified. However, gene abundance data is affected by high levels of systematic variability, which can greatly reduce the statistical power and introduce false positives. Normalization, which is the process where systematic variability is identified and removed, is therefore a vital part of the data analysis. A wide range of normalization methods for high-dimensional count data has been proposed but their performance on the analysis of shotgun metagenomic data has not been evaluated.ResultsHere, we present a systematic evaluation of nine normalization methods for gene abundance data. The methods were evaluated through resampling of three comprehensive datasets, creating a realistic setting that preserved the unique characteristics of metagenomic data. Performance was measured in terms of the methods ability to identify differentially abundant genes (DAGs), correctly calculate unbiased p-values and control the false discovery rate (FDR). Our results showed that the choice of normalization method has a large impact on the end results. When the DAGs were asymmetrically present between the experimental conditions, many normalization methods had a reduced true positive rate (TPR) and a high false positive rate (FPR). The methods trimmed mean of M-values (TMM) and relative log expression (RLE) had the overall highest performance and are therefore recommended for the analysis of gene abundance data. For larger sample sizes, CSS also showed satisfactory performance.ConclusionsThis study emphasizes the importance of selecting a suitable normalization methods in the analysis of data from shotgun metagenomics. Our results also demonstrate that improper methods may result in unacceptably high levels of false positives, which in turn may lead to incorrect or obfuscated biological interpretation.

Highlights

  • In shotgun metagenomics, microbial communities are studied through direct sequencing of DNA without any prior cultivation

  • In this study, we compared the performance of nine normalization methods for shotgun metagenomic gene abundance data

  • Each artificial dataset was normalized using the nine different methods and the ability to correctly identify the Differentially abundant gene (DAG) was assessed. This set-up was used to investigate how the performance of the normalization methods changed under different characteristics of the data such as group size, proportion of DAGs and their distribution between the two groups

Read more

Summary

Introduction

Microbial communities are studied through direct sequencing of DNA without any prior cultivation. Normalization, which is the process where systematic variability is identified and removed, is a vital part of the data analysis. Gene abundance data generated by shotgun metagenomics is affected with multiple sources of variability which makes it notoriously hard to interpret [10,11,12]. One example of systematic variability is the differences in sequence depth, where each sample is represented by a varying number of reads [13]. Systematic variability comes from other technical sources, such as inconsistencies in the DNA extraction and sample handling, varying quality between sequencing runs, errors in the read mapping, and incompleteness of the reference databases [14]. A process referred to as normalization, is vital to achieve a satisfactory statistical power and an acceptable FPR

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call