Abstract
The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing. In contrast, there is a lack of equivalent analysis and tools for RNA splicing. Here we first assess the effect of confounders on both expression and splicing quantifications in two large public RNA-Seq datasets (TARGET, ENCODE). We show quantification of splicing variations are affected at least as much as those of gene expression, revealing unwanted sources of variations in both datasets. Next, we develop MOCCASIN, a method to correct the effect of both known and unknown confounders on RNA splicing quantification and demonstrate MOCCASIN’s effectiveness on both synthetic and real data. Code, synthetic and corrected datasets are all made available as resources.
Highlights
The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing
We set out to assess the effect of confounders on both expression and splicing analysis in two large and highly used datasets, TARGET (Therapeutically Applicable Research to Generate Effective Treatments initiative) and ENCODE25, each with hundreds of samples
The ENCODE dataset comprises 236 and 238 shRNA knockdown experiments performed in K562 and HepG2 cell lines, respectively, with many of the knockdown experiments targeting RNA binding proteins
Summary
The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing. Samples that cluster together by gene expression or RNA splicing variations; and quantitative trait loci analysis to identify genetic variants associated with changes in expression (eQTL) or splicing (sQTL) The results of such analyses can be greatly affected by unwanted factors such as sequencing lane[1,2] or processing batch[3,4]. Many other methods focus on “local” splicing changes, quantifying expression at the exon levels (e.g., DEXSeq16), or the relative usage of specific RNA segments or splice junctions This latter approach, described in more details below, involves quantifying the percent spliced in (PSI) for local splice variations (LSVs), or AS events. In contrast to the common usage of PSI quantification methods for the study of RNA splicing, there is a clear lack of tools for modeling known and unknown confounding factors in PSI-based splicing analysis. We were not able to find any previous work that quantitatively assessed the effect of confounders on splicing analysis and compared it to the effect on gene expression
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.