Strategies for cellular deconvolution in human brain RNA sequencing data

Olukayode A Sosina,Bryan C Quach,Stephen A Semick,Andrew E Jaffe,Jeffrey T Leek,Kristen R Maynard,Joel E Kleinman,Ran Tao,Margaret A Taub,Keri Martinowich,Thomas Hyde,Daniel R Weinberger,Dana B Hancock,Matthew N Tran

doi:10.12688/f1000research.50858.1

Abstract

Background: Statistical deconvolution strategies have emerged over the past decade to estimate the proportion of various cell populations in homogenate tissue sources like brain using gene expression data. However, no study has been undertaken to assess the extent to which expression-based and DNAm-based cell type composition estimates agree. Results: Using estimated neuronal fractions from DNAm data, from the same brain region (i.e., matched) as our bulk RNA-Seq dataset, as proxies for the true unobserved cell-type fractions (i.e., as the gold standard), we assessed the accuracy (RMSE) and concordance (R2) of four reference-based deconvolution algorithms: Houseman, CIBERSORT, non-negative least squares (NNLS)/MIND, and MuSiC. We did this for two cell-type populations - neurons and non-neurons/glia - using matched single nuclei RNA-Seq and mismatched single cell RNA-Seq reference datasets. With the mismatched single cell RNA-Seq reference dataset, Houseman, MuSiC, and NNLS produced concordant (high correlation; Houseman R2 = 0.51, 95% CI [0.39, 0.65]; MuSiC R2 = 0.56, 95% CI [0.43, 0.69]; NNLS R2 = 0.54, 95% CI [0.32, 0.68]) but biased (high RMSE, >0.35) neuronal fraction estimates. CIBERSORT produced more discordant (moderate correlation; R2 = 0.25, 95% CI [0.15, 0.38]) neuronal fraction estimates, but with less bias (low RSME, 0.09). Using the matched single nuclei RNA-Seq reference dataset did not eliminate bias (MuSiC RMSE = 0.17). Conclusions: Our results together suggest that many existing RNA deconvolution algorithms estimate the RNA composition of homogenate tissue, e.g. the amount of RNA attributable to each cell type, and not the cellular composition, which relates to the underlying fraction of cells.

Highlights

Homogenate tissues like brain and blood contain a mixture of cell types which can each have unique genomic profiles, and these mixtures of cell types, termed “cellular composition”, can vary across samples (Jaffe and Irizarry 2014)
Price et al 2019); here we found very high correlation (ρ = À0.949, Figure S1, Extended data (Sosina et al 2021)) between the neuronal fraction and the first principal component (PC) of the entire DNA methylation (DNAm) profile (32.3% of variance explained), which we have shown to be an accurate surrogate of composition in frontal cortex (Jaffe et al 2016) and blood (Jaffe and Irizarry 2014)
Statistical deconvolution strategies have emerged over the past decade to estimate the proportion of various cell populations in homogenate tissue sources like blood and brain from both gene expression and DNAm data

Summary

Introduction

Homogenate tissues like brain and blood contain a mixture of cell types which can each have unique genomic profiles, and these mixtures of cell types, termed “cellular composition”, can vary across samples (Jaffe and Irizarry 2014). Results: Using estimated neuronal fractions from DNAm data, from the same brain region (i.e., matched) as our bulk RNA-Seq dataset, as proxies for the true unobserved cell-type fractions (i.e., as the gold standard), we assessed the accuracy (RMSE) and concordance (R2) of four reference-based deconvolution algorithms: Houseman, CIBERSORT, non-negative least squares (NNLS)/MIND, and MuSiC. We did this for two cell-type populations - neurons and non-neurons/glia using matched single nuclei RNA-Seq and mismatched single cell RNASeq reference datasets.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: F1000Research	Publication Date: Aug 4, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Strategies for cellular deconvolution in human brain RNA sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Similar Papers

A multivariate analysis approach to the integration of proteomic and gene expression data
Ailís Fagan ... Desmond G Higgins
PROTEOMICS | VOL. 7
Ailís Fagan, et. al.Ailís Fagan ... Desmond G Higgins
01 Jun 2007
PROTEOMICS | VOL. 7

Inference of gene interaction networks using conserved subsequential patterns from multiple time course gene expression datasets.
Qian Liu ... Renhua Song
BMC Genomics | VOL. Suppl 16 12
Qian Liu, et. al.Qian Liu ... Renhua Song
01 Dec 2015
BMC Genomics | VOL. Suppl 16 12

Heterogeneous Gene Data for Classifying Tumors
Benny Yiu-Ming Fung ... Vincent To-Yee Ng
-
Benny Yiu-Ming Fung, et. al.Benny Yiu-Ming Fung ... Vincent To-Yee Ng
01 Jan 2004
01 Jan 2004

Comparative Analysis of Selenocysteine Machinery and Selenoproteome Gene Expression in Mouse Brain Identifies Neurons as Key Functional Sites of Selenium in Mammals
Yan Zhang ... Vadim N Gladyshev
Journal of Biological Chemistry | VOL. 283
Yan Zhang, et. al.Yan Zhang ... Vadim N Gladyshev
01 Jan 2008
Journal of Biological Chemistry | VOL. 283

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Strategies for cellular deconvolution in human brain RNA sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research