MCMSeq: Bayesian hierarchical modeling of clustered and repeated measures RNA sequencing experiments

Brian E Vestal,Katerina Kechris,Camille M Moore,Elizabeth Wynn,Laura Saba,Tasha Fingerlin

doi:10.1186/s12859-020-03715-y

Abstract

BackgroundAs the barriers to incorporating RNA sequencing (RNA-Seq) into biomedical studies continue to decrease, the complexity and size of RNA-Seq experiments are rapidly growing. Paired, longitudinal, and other correlated designs are becoming commonplace, and these studies offer immense potential for understanding how transcriptional changes within an individual over time differ depending on treatment or environmental conditions. While several methods have been proposed for dealing with repeated measures within RNA-Seq analyses, they are either restricted to handling only paired measurements, can only test for differences between two groups, and/or have issues with maintaining nominal false positive and false discovery rates. In this work, we propose a Bayesian hierarchical negative binomial generalized linear mixed model framework that can flexibly model RNA-Seq counts from studies with arbitrarily many repeated observations, can include covariates, and also maintains nominal false positive and false discovery rates in its posterior inference.ResultsIn simulation studies, we showed that our proposed method (MCMSeq) best combines high statistical power (i.e. sensitivity or recall) with maintenance of nominal false positive and false discovery rates compared the other available strategies, especially at the smaller sample sizes investigated. This behavior was then replicated in an application to real RNA-Seq data where MCMSeq was able to find previously reported genes associated with tuberculosis infection in a cohort with longitudinal measurements.ConclusionsFailing to account for repeated measurements when analyzing RNA-Seq experiments can result in significantly inflated false positive and false discovery rates. Of the methods we investigated, whether they model RNA-Seq counts directly or worked on transformed values, the Bayesian hierarchical model implemented in the mcmseq R package (available at https://github.com/stop-pre16/mcmseq) best combined sensitivity and nominal error rate control.

Highlights

As the barriers to incorporating RNA sequencing (RNA-Seq) into biomedical studies continue to decrease, the complexity and size of RNA-Seq experiments are rapidly growing
Simulation results Convergence Across all sample sizes, NB generalized linear mixed model (NBGLMM) and Linear Mixed Model (LMM) methods had substantial proportions of models that failed to converge (Table 4), ranging from ≈ 4% to ≈ 20% of all genes tested at a given sample size
Number of significant genes For brevity, we focus our discussion on the edgeR, edgeR*, limma, LMM, MCMSeq, NBGLMM and ShrinkBayes methods; full results for all methods can be found in the Supplementary materials

Summary

Introduction

As the barriers to incorporating RNA sequencing (RNA-Seq) into biomedical studies continue to decrease, the complexity and size of RNA-Seq experiments are rapidly growing. While several methods have been proposed for dealing with repeated measures within RNA-Seq analyses, they are either restricted to handling only paired measurements, can only test for differences between two groups, and/or have issues with maintaining nominal false positive and false discovery rates. As well-developed statistical methods and software tools are lacking, some have proposed analyzing correlated RNA-Seq data with DESeq or edgeR by including all levels of the clustering variable as fixed effects in the regression model [3, 5]. In this case, the number of parameters necessary to account for correlation is equal to the number of subjects or clusters. A method that can analyze RNASeq data in a flexible manner while including covariates and accounting for repeated measures is needed

Methods

Results

Discussion

Conclusion