SQuIRE reveals locus-specific regulation of interspersed repeat expression.

Wan R Yang,Clarissa N Pacyna,Daniel Ardeljan,Kathleen H Burns,Lindsay M Payer

doi:10.1093/nar/gky1301

Abstract

Transposable elements (TEs) are interspersed repeat sequences that make up much of the human genome. Their expression has been implicated in development and disease. However, TE-derived RNA-seq reads are difficult to quantify. Past approaches have excluded these reads or aggregated RNA expression to subfamilies shared by similar TE copies, sacrificing quantitative accuracy or the genomic context necessary to understand the basis of TE transcription. As a result, the effects of TEs on gene expression and associated phenotypes are not well understood. Here, we present Software for Quantifying Interspersed Repeat Expression (SQuIRE), the first RNA-seq analysis pipeline that provides a quantitative and locus-specific picture of TE expression (https://github.com/wyang17/SQuIRE). SQuIRE is an accurate and user-friendly tool that can be used for a variety of species. We applied SQuIRE to RNA-seq from normal mouse tissues and a Drosophila model of amyotrophic lateral sclerosis. In both model organisms, we recapitulated previously reported TE subfamily expression levels and revealed locus-specific TE expression. We also identified differences in TE transcription patterns relating to transcript type, gene expression and RNA splicing that would be lost with other approaches using subfamily-level analyses. Altogether, our findings illustrate the importance of studying TE transcription with locus-level resolution.

Highlights

Further details of Count Count uses a combination of SAMTools (Li et al 2009), BEDTools (Quinlan and Hall 2010), awk and bash within a Python script to perform the algorithm described in the main text, in particular distinguishing uniquely aligning reads from multi-mapping reads
Because the quantitation in SQuIRE relies on uniquely aligning reads, SQuIRE needed to resolve three issues in identifying uniquely aligning reads and their mapped TE location
1) Because RepeatMasker annotation includes overlapping TE coordinates, a read can map uniquely at one genomic location corresponding to two TE loci

Summary

Introduction

Further details of Count Count uses a combination of SAMTools (Li et al 2009), BEDTools (Quinlan and Hall 2010), awk and bash within a Python script to perform the algorithm described in the main text, in particular distinguishing uniquely aligning reads from multi-mapping reads. It will output bedgraphs of all reads (“multi”) and only uniquely (“unique”) aligning reads. If the RNA-seq data is stranded it will output unique and multi bedgraphs for each strand.

Results

Conclusion