Abstract
It is apparent that non-coding transcripts are a common feature of higher organisms and encode uncharacterized layers of genetic regulation and information. We used public bovine EST data from many developmental stages and tissues, and developed a pipeline for the genome wide identification and annotation of non-coding RNAs (ncRNAs). We have predicted 23,060 bovine ncRNAs, 99% of which are un-annotated, based on known ncRNA databases. Intergenic transcripts accounted for the majority (57%) of the predicted ncRNAs and the occurrence of ncRNAs and genes were only moderately correlated (r = 0.55, p-value<2.2e-16). Many of these intergenic non-coding RNAs mapped close to the 3′ or 5′ end of thousands of genes and many of these were transcribed from the opposite strand with respect to the closest gene, particularly regulatory-related genes. Conservation analyses showed that these ncRNAs were evolutionarily conserved, and many intergenic ncRNAs proximate to genes contained sequence-specific motifs. Correlation analysis of expression between these intergenic ncRNAs and protein-coding genes using RNA-seq data from a variety of tissues showed significant correlations with many transcripts. These results support the hypothesis that ncRNAs are common, transcribed in a regulated fashion and have regulatory functions.
Highlights
As a result of advances in DNA sequencing technologies, a number of mammalian genomes have been sequenced and assembled
The development of non-coding RNAs (ncRNAs) identification pipeline We identified ncRNAs from bovine Expressed Sequence Tags (ESTs), by developing a computational pipeline based on public software and Perl scripts (Figure 1)
For most intergenic ncRNAs detected by the RNA-seq data (191 out of 389 at 59 end and 1,678 out of 2,673 at 39 end), we identified significantly associated protein-coding genes based on MIC (Maximal Information Coefficient) score, with FDR#0.05 after multiple testing (Table S9), and many of these showed significant associations with multiple protein-coding genes in terms of their expression, with 35 out of 191 59 intergenic ncRNAs and 425 of 1,678 39 end intergenic ncRNAs correlated with their neighbour genes (Table S9). 78 of the 191 59 intergenic ncRNAs and 1,124 of the 1,678 39 end intergenic ncRNAs were UTR-related RNAs
Summary
As a result of advances in DNA sequencing technologies, a number of mammalian genomes have been sequenced and assembled. While proteincoding genes are considered the most important elements of the genome, they only account for a small fraction of the genome sequence or the mammalian transcriptome. This indicates that the complexity of the mammalian genome, especially the transcriptome, cannot be interpreted merely according to the central dogma of molecular biology ‘‘DNA-RNA-protein’’ [1,2,3,4,5]. Studies from the FANTOM consortium have confirmed that the majority of the mouse genome is transcribed, commonly from both strands. Most of these transcripts cannot be annotated as protein-coding RNAs [4]. These findings are evidence of a hidden, non-protein-coding transcriptome in mammals
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have