Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data

H Beiki,N Manchanda,H Liu,J M Reecy,D Nonneman,C K Tuggle,J Huang,T P L Smith

doi:10.1186/s12864-019-5709-y

Abstract

BackgroundOur understanding of the pig transcriptome is limited. RNA transcript diversity among nine tissues was assessed using poly(A) selected single-molecule long-read isoform sequencing (Iso-seq) and Illumina RNA sequencing (RNA-seq) from a single White cross-bred pig.ResultsAcross tissues, a total of 67,746 unique transcripts were observed, including 60.5% predicted protein-coding, 36.2% long non-coding RNA and 3.3% nonsense-mediated decay transcripts. On average, 90% of the splice junctions were supported by RNA-seq within tissue. A large proportion (80%) represented novel transcripts, mostly produced by known protein-coding genes (70%), while 17% corresponded to novel genes. On average, four transcripts per known gene (tpg) were identified; an increase over current EBI (1.9 tpg) and NCBI (2.9 tpg) annotations and closer to the number reported in human genome (4.2 tpg). Our new pig genome annotation extended more than 6000 known gene borders (5′ end extension, 3′ end extension, or both) compared to EBI or NCBI annotations. We validated a large proportion of these extensions by independent pig poly(A) selected 3′-RNA-seq data, or human FANTOM5 Cap Analysis of Gene Expression data. Further, we detected 10,465 novel genes (81% non-coding) not reported in current pig genome annotations. More than 80% of these novel genes had transcripts detected in > 1 tissue. In addition, more than 80% of novel intergenic genes with at least one transcript detected in liver tissue had H3K4me3 or H3K36me3 peaks mapping to their promoter and gene body, respectively, in independent liver chromatin immunoprecipitation data.ConclusionsThese validated results show significant improvement over current pig genome annotations.

Highlights

Our understanding of the pig transcriptome is limited
To identify a more complete catalogue of transcript isoforms across porcine tissues, we processed poly(A) selected Pacific Biosciences (PacBio) isoform sequencing (Iso-seq) and Illumina RNA sequencing (RNA-seq) data from nine tissues. This data provided evidence to improve the annotation of thousands of protein-coding and long non-coding RNA genes, such that the complexity of the pig transcriptome is similar to that reported for the highly-annotated human genome
Using data from an independent liver chromatin immunoprecipitation (ChIP) sequencing experiment (Additional file 1: Table S3), we found that more than 80% (616) of the novel Ensembl and National Center for Biotechnology Information (NCBI) intergenic genes detected in liver tissue (694) had significant tri-methylation of lysine on histone H3 (H3K4me3) that mapped to their promoters, i.e. the genomic region that spans from 500 base pairs 5′ to 100 bp 3′ of the genes first exon (Fig. 4h, see illustrative examples in Fig. 5 and Additional file 1: Figure S8)

Summary

Introduction

Our understanding of the pig transcriptome is limited. RNA transcript diversity among nine tissues was assessed using poly(A) selected single-molecule long-read isoform sequencing (Iso-seq) and Illumina RNA sequencing (RNA-seq) from a single White cross-bred pig. Despite the value of pigs to agriculture, The recent, long read-based update to the pig reference genome assembly was a major step forward for swine research. This genome assembly (Sscrofa11.1) was annotated both at the European Bioinformatics Institute (EBI) [4] and National Center for Biotechnology Information (NCBI) [5]. These annotations represent significant improvement over the previous pig genome annotation (Sscrofa10.2) [6], they are still far from complete. The number of annotated genes and transcripts per gene (tpg) in the current pig genome annotations (NCBI release 109: 30,334 genes and 2.9 tpg, Ensembl release 93: 25,880 genes and 1.9 tpg) are fewer than reported for genome of human

Methods

Results

Discussion

Conclusion