Abstract

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.

Highlights

  • In research areas such as RNA-sequencing (RNA-Seq) and microbiomics, sequencing technologies are applied to measure the composition of mixtures of nucleic acids [1, 2]

  • In this paper we propose a new statistical goodness of fit (GoF) test for the negative binomial (NB) distribution in regression models that are commonly used for analysing RNA-Seq and microbiome studies

  • Sequencing count data are often assumed to follow the NB or zero-inflated negative binomial (ZINB) distributions, which form the basis of several statistical procedures for testing for differential expression (RNASeq) or differential abundance

Read more

Summary

Sequence count data are poorly fit by the negative binomial distribution

OPEN ACCESS Citation: Hawinkel S, Rayner JCW, Bijnens L, Thas O (2020) Sequence count data are poorly fit by the negative binomial distribution. Editor: Shailesh Kumar, National Institute of Plant Genome Research (NIPGR), INDIA Received: October 22, 2019

Introduction
Construction of the test statistic
Simulation study
Application to sequencing data
Conclusion and recommendation
Supporting information
Findings
Author Contributions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.