Evaluation of logistic regression models and effect of covariates for case\u2013control study in RNA-Seq analysis

Seung Hoan Choi,Josée Dupuis,Anita L Destefano,Kathryn L Lunetta,Adam T Labadorf,Richard H Myers

doi:10.1186/s12859-017-1498-y

Abstract

BackgroundNext generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates.ResultsWhen the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth’s logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression.ConclusionsWe conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth’s logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.

Highlights

Generation sequencing provides a count of Ribonucleic Acid (RNA) molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements
These RNA sequencing (RNA-Seq) methods produce data that can be transformed into numerical values that are proportional to the abundance of RNA molecules and reflect the expression and turnover of those molecules
Large sample size and low dispersion generally yielded Type-I error rates that were close to the specified alpha levels as shown in Additional file 3: Figure S1

Summary

Introduction

Generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. Generation sequencing (NGS) gene expression measurement methods simultaneously quantify tens of thousands of unique Ribonucleic Acid (RNA) molecules extracted from biological samples. These RNA sequencing (RNA-Seq) methods produce data that can be transformed into numerical values that are proportional to the abundance of RNA molecules and reflect the expression and turnover of those molecules. The Negative Binomial (NB) distribution appropriately models the biological dispersion of a gene, and NB regression has been used to analyze RNA-Seq data. When Y, a random variable, follows a NB distribution with mean (μ) and dispersion (φ), the parameterization of the probability mass function, expected value, and variance of Y are

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 6, 2017
Citations: 22	License type: open-access

R Discovery Prime

R Discovery Prime

Evaluation of logistic regression models and effect of covariates for case\u2013control study in RNA-Seq analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Goodness-of-fit tests and model diagnostics for negative binomial regression of RNA sequencing data.
Gu Mi ... Yanming Di
PloS one | VOL. 10
Gu Mi, et. al.Gu Mi ... Yanming Di
18 Mar 2015
PloS one | VOL. 10

Methods for analysis of deep sequencing data from mixtures of Plasmodium falciparum clones or stage-specific transcriptomes

-

01 Jan 2018
01 Jan 2018

Author response: Targeting the fatty acid binding proteins disrupts multiple myeloma cell cycle progression and MYC signaling
Mariah Farrell ... Lauren Mcguinness
-
Mariah Farrell, et. al.Mariah Farrell ... Lauren Mcguinness
01 Feb 2023
01 Feb 2023

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.
Hung-I Harry Chen ... Yi Zou
BMC Genomics | VOL. Suppl 16 7
Hung-I Harry Chen, et. al.Hung-I Harry Chen ... Yi Zou
11 Jun 2015
BMC Genomics | VOL. Suppl 16 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluation of logistic regression models and effect of covariates for case\u2013control study in RNA-Seq analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics