Sequence count data are poorly fit by the negative binomial distribution.

Stijn Hawinkel,J C W Rayner,Olivier Thas,Luc Bijnens

doi:10.1371/journal.pone.0224909

Abstract

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.

Highlights

In research areas such as RNA-sequencing (RNA-Seq) and microbiomics, sequencing technologies are applied to measure the composition of mixtures of nucleic acids [1, 2]
In this paper we propose a new statistical goodness of fit (GoF) test for the negative binomial (NB) distribution in regression models that are commonly used for analysing RNA-Seq and microbiome studies
Sequencing count data are often assumed to follow the NB or zero-inflated negative binomial (ZINB) distributions, which form the basis of several statistical procedures for testing for differential expression (RNASeq) or differential abundance

Summary

Sequence count data are poorly fit by the negative binomial distribution

OPEN ACCESS Citation: Hawinkel S, Rayner JCW, Bijnens L, Thas O (2020) Sequence count data are poorly fit by the negative binomial distribution. Editor: Shailesh Kumar, National Institute of Plant Genome Research (NIPGR), INDIA Received: October 22, 2019

Introduction

Construction of the test statistic

Simulation study

Application to sequencing data

Conclusion and recommendation

Supporting information

Findings

Author Contributions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Apr 30, 2020
Citations: 34	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Sequence count data are poorly fit by the negative binomial distribution.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Methodologic Challenges in the Analysis of Count Data in Radiology Health Services Research
Bahman Roudsari ... Jeffrey G Jarvik
Journal of the American College of Radiology | VOL. 8
Bahman Roudsari, et. al.Bahman Roudsari ... Jeffrey G Jarvik
30 Jul 2011
Journal of the American College of Radiology | VOL. 8

ScGCL: an imputation method for scRNA-seq data based on graph contrastive learning.
Zehao Xiong ... Wanwan Shi
Bioinformatics | VOL. 39
Zehao Xiong, et. al.Zehao Xiong ... Wanwan Shi
24 Feb 2023
Bioinformatics | VOL. 39

CountfitteR: efficient selection of count distributions to assess DNA damage.
Jarosław Chilimoniuk ... Romano Weiss
Annals of Translational Medicine | VOL. 9
Jarosław Chilimoniuk, et. al.Jarosław Chilimoniuk ... Romano Weiss
01 Apr 2021
Annals of Translational Medicine | VOL. 9

Generalized linear model based monitoring methods for high‐yield processes
Tahir Mahmood
Quality and Reliability Engineering International | VOL. 36
Tahir MahmoodTahir Mahmood
03 Mar 2020
Quality and Reliability Engineering International | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence count data are poorly fit by the negative binomial distribution.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE