Abstract

Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Highlights

  • Short-read sequencing technology has underpinned transcriptomic profiling research over the past decade

  • Our Differential gene expression (DGE) analysis uses a limma-voom workflow and shows that results from PCR-cDNA and direct-cDNA long-reads are reliable, such that estimated results are comparable to the known truth in the sequins synthetic control dataset, and concordant with corresponding short-read studies

  • The total library size in the sequins dataset is lower than that of the neural stem cells (NSCs) dataset, more reads were assigned per gene since the dataset contains a small set of genes, which improved power for DGE analysis

Read more

Summary

Introduction

Short-read sequencing technology has underpinned transcriptomic profiling research over the past decade. The sequencing platforms offered by companies such as Illumina Inc. provide high read accuracy (>99.9%) and throughput which allows many samples to be profiled in parallel. One major limitation of short-read sequencing technology is the modest read lengths offered (currently up to 600 bases), which makes accurate isoform quantification and novel isoform discovery challenging. Long-read sequencing offers a distinct advantage in this regard, with the ability to generate reads that are typically in the 1–100 kilobase (kb) range [1], which spans the typical length distribution of spliced genes in human (for protein coding genes 1–3 kb is typical with outliers such as Titin at >80 kb) thereby allowing the sequencing of entire isoforms. This, comes at the expense of lower throughput and reduced accuracy compared to short-read sequencing. The two main technology platforms that dominate the field of long-read sequencing are Pacific Biosciences’ (PacBio) Single-Molecule Real Time (SMRT) sequencing and Oxford Nanopore Technologies’ (ONT) nanopore sequencing

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call