3'-end poly(A)+ sequencing is an efficient and economical method for global measurement of mRNA levels and alternative poly(A) site usage. A common method involves oligo(dT)19V reverse-transcription (RT)-based library preparation and high-throughput sequencing with a custom primer ending in (dT)19. While the majority of library products have the first sequenced nucleotide reflect the bona fide poly(A) site (pA), a substantial fraction of sequencing reads arise from various mis-priming events. These can result in incorrect pA site calls anywhere from several nucleotides downstream to several kilobases upstream from the bona fide pA site. While these mis-priming events can be mitigated by increasing annealing stringency (e.g. increasing temperature from 37 °C to 42 °C), they still persist at an appreciable level (∼10%) and computational methods must be used to prevent artifactual calls. Here we present a bioinformatics workflow for precise mapping of poly(A)+ 3' ends and handling of artifacts due to oligo(dT) mis-priming and sample polymorphisms. We test pA site calling with three different read mapping programs (STAR, BWA, and BBMap), and show that the way in which each handles terminal mismatches and soft clipping has a substantial impact on identifying correct pA sites, with BWA requiring the least post-processing to correct artifacts. We demonstrate the use of this pipeline for mapping pA sites in the model eukaryote S. cerevisiae, and further apply this technology to non-polyadenylated transcripts by employing in vitro polyadenylation prior to library prep (IVP-seq). As proof of principle, we show that a fraction of tRNAs harbor CCU 3' tails instead of the canonical CCA tail, and globally identify 3' ends of splicing intermediates arising from inefficiently spliced transcripts.
Read full abstract