Abstract

Introduction The human cytomegalovirus (HCMV) is a ubiquitous herpesvirus and has a complex transcriptome. Polycistronism and alternative splicing make forming accurate transcript models particularly challenging. Long-read sequencing is a powerful nover tool that is able to distinguish between isoforms and discern a complex transcriptome. In order to gain a better insight into the transcriptional repertoire of the virus, we have sequenced the lytic HCMV transcriptome on multiple third-generation sequencing platforms. Our main objectives were to determine exon-connectivity, and to annotate the lytic transcriptome of the virus. In order to utilize the power of long-read sequencing, we have developed a pipeline that is suited for the analysis of long-read RNA sequencing data and is able to compare results obtained from different sequencing platforms. We also aimed to characterize the performance of each sequencing platform and library preparation method based on their ability to sequence full-length genuine transcripts. Materials and Methods Two biologically independent samples were sequenced. The first sample was subjected to cDNA sequencing on the Pacific Biosciences (PacBio) RSII and Sequel platforms as well as cDNA and dRNA sequencing on the Oxford Nanopore Technologies (ONT) MinION platform. The second sample was used for cap-selected cDNA sequencing on the MinION platform. The data were analysed using a custom pipeline utilizing the biopython and the pysam modules, and the bedtools software. Custom scripts were written to generate read statistics, characterize transcripts and to compare results. Results Over 80,000 cDNA reads were obtained from the two PacBio platforms and over 1,000,000 cDNA reads from the MinION platform. The direct RNA sequencing yielded 36,195 reads. The direct RNA sequencing reads were used to validate the cDNA sequencing results. We have created a pipeline for the analysis of long-read RNA sequencing data which accepts mapped sequencing reads produced by any long-read sequencing platform, and outputs a transcriptome annotation based on the sequenced reads. 440 isoforms were detected in our dataset. 377 of them were novel isoforms. The novel transcripts include TSS-, TES- or alternatively spliced isoforms of known genes, antisense transcripts and a novel intergenic transcript in the short repeat region. Many of the transcript isoforms only differed from each other in a few nucleotides, however, interestingly, most isoforms differed from each other in the combination of ORFs that they contained. Discussion Our results have more than doubled the number of annotated HCMV transcripts. Cross-platform validation gives these novel features high confidence. Using long-read RNA sequencing data we were able to draw a more detailed map of the HCMV transcriptome, which is instrumental both for the analysis of the viral gene expression and for understanding the molecular mechanisms of infection. Long-read RNA sequencing has discovered countless new isoforms in all the organisms for which it has been used. The biological function of most of these isoforms is currently unknown. However, our results show that many of the isoforms have distinct coding potentials, meaning that they code for different peptides of express upstream ORFs which may play a regulatory role during translation. With the headway of long-read sequencing technologies, the importance of bioinformatics tools that can analyse such data is increasing. We developed a pipeline which can rapidly process long-read RNA sequencing data from different platforms and create a transcriptome annotation which can be utilized by user with no bioinformatics background.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call