Abstract
Single cell RNA-seq data, like data from other sequencing technology, contain systematic technical noise. Such noise results from a combined effect of unequal efficiencies in the capturing and counting of mRNA molecules, such as extraction/amplification efficiency and sequencing depth. We show that such technical effects are not only cell-specific, but also affect genes differently, thus a simple cell-wise size factor adjustment may not be sufficient. We present a non-linear normalization approach that provides a cell- and gene-specific normalization factor for each gene in each cell. We show that the proposed normalization method (implemented in “SC2P" package) reduces more technical variation than competing methods, without reducing biological variation. When technical effects such as sequencing depths are not balanced between cell populations, SC2P normalization also removes the bias due to uneven technical noise. This method is applicable to scRNA-seq experiments that do not use unique molecular identifier (UMI) thus retain amplification biases.
Highlights
Single Cell RNA-sequencing has become a widely applied tool to study the diverse and dynamic transcriptional activities among cell populations (Tang et al, 2009)
We use the alpha cells as an example to illustrate variation within a cell type. This data set is available at Gene Expression Omnibus (GEO) with accession number GSE86473
We present a normalization method that provides a cell- and gene-specific normalization factor that borrows information across genes and across cells
Summary
Single Cell RNA-sequencing (scRNA-seq) has become a widely applied tool to study the diverse and dynamic transcriptional activities among cell populations (Tang et al, 2009). Methods for data processing, including mapping short reads to the reference transcriptome and normalization to account for technical variability in the efficiency of RNA extraction, amplification and counting, evolved along the progress of the sequencing technology. These include simple size factors to adjust for global effects such as sequencing depth, such as widely used count per million (CPM) or reads per million per kilobase (RPKM) for their simplicity (Mortazavi et al, 2008), and more data adaptive trimmed mean of M values (TMM) (Robinson and Oshlack, 2010). The changes of the location, scale, or shape of the distribution are attributed to technical effects and removed in normalization (Robinson and Oshlack, 2010; Hansen et al, 2012). scRNA-seq data share many
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have