Abstract

BackgroundDeregulated gene expression is a hallmark of cancer; however, most studies to date have analyzed short-read RNA sequencing data with inherent limitations. Here, we combine PacBio long-read isoform sequencing (Iso-Seq) and Illumina paired-end short-read RNA sequencing to comprehensively survey the transcriptome of gastric cancer (GC), a leading cause of global cancer mortality.ResultsWe performed full-length transcriptome analysis across 10 GC cell lines covering four major GC molecular subtypes (chromosomal unstable, Epstein-Barr positive, genome stable and microsatellite unstable). We identify 60,239 non-redundant full-length transcripts, of which > 66% are novel compared to current transcriptome databases. Novel isoforms are more likely to be cell line and subtype specific, expressed at lower levels with larger number of exons, with longer isoform/coding sequence lengths. Most novel isoforms utilize an alternate first exon, and compared to other alternative splicing categories, are expressed at higher levels and exhibit higher variability. Collectively, we observe alternate promoter usage in 25% of detected genes, with the majority (84.2%) of known/novel promoter pairs exhibiting potential changes in their coding sequences. Mapping these alternate promoters to TCGA GC samples, we identify several cancer-associated isoforms, including novel variants of oncogenes. Tumor-specific transcript isoforms tend to alter protein coding sequences to a larger extent than other isoforms. Analysis of outcome data suggests that novel isoforms may impart additional prognostic information.ConclusionsOur results provide a rich resource of full-length transcriptome data for deeper studies of GC and other gastrointestinal malignancies.

Highlights

  • Deregulated gene expression is a hallmark of cancer; most studies to date have analyzed short-read RNA sequencing data with inherent limitations

  • Landscape of long-read full-length isoforms in gastric cancer (GC) cell lines To obtain a representative overview of full-length transcripts in GC, we performed PacBio long-read RNA sequencing on ten GC cell lines

  • Benchmarking the novel isoforms against high-quality known isoforms (FSMs), we found that novel in catalog (NIC) and not in catalog (NNC) novel isoforms exhibited comparable quality to known isoforms, while incomplete-splice matches (ISM) exhibited a lower proportion of overlap with Cap Analysis of Gene Expression (CAGE) peaks

Read more

Summary

Introduction

Deregulated gene expression is a hallmark of cancer; most studies to date have analyzed short-read RNA sequencing data with inherent limitations. Conventional short-read RNA sequencing has been widely used to identify transcripts and gene expression changes in GC [4,5,6] While this method has been effective in quantifying transcript abundance, short reads (usually 100 to 250 base pair) rarely span full-length transcripts, which can often be several kilobases long, making it difficult to directly infer full-length transcript structure. These limitations are pronounced in complex human transcriptomes such as GC, which may express many distinct but very similar isoforms resulting from different alternative promoters, exons, and 3′ untranslated regions (UTRs) [7, 8]. The full-length transcriptome of GC has remained under-explored, despite the potential importance of this information to understand the biological roles of alternative isoforms

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call