Abstract

The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. However, transcriptome sequencing in Arabidopsis continues to suggest the presence of polyadenylated (polyA) transcripts originating from presumed intergenic regions. It is not clear whether these transcripts represent novel noncoding or protein-coding genes. To understand the nature of intergenic polyA transcription, we first assessed its abundance using multiple messenger RNA sequencing data sets. We found 6,545 intergenic transcribed fragments (ITFs) occupying 3.6% of Arabidopsis intergenic space. In contrast to transcribed fragments that map to protein-coding and RNA genes, most ITFs are significantly shorter, are expressed at significantly lower levels, and tend to be more data set specific. A surprisingly large number of ITFs (32.1%) may be protein coding based on evidence of translation. However, our results indicate that these "translated" ITFs tend to be close to and are likely associated with known genes. To investigate if ITFs are under selection and are functional, we assessed ITF conservation through cross-species as well as within-species comparisons. Our analysis reveals that 237 ITFs, including 49 with translation evidence, are under strong selective constraint and relatively distant from annotated features. These ITFs are likely parts of novel genes. However, the selective pressure imposed on most ITFs is similar to that of randomly selected, untranscribed intergenic sequences. Our findings indicate that despite the prevalence of ITFs, apart from the possibility of genomic contamination, many may be background or noisy transcripts derived from "junk" DNA, whose production may be inherent to the process of transcription and which, on rare occasions, may act as catalysts for the creation of novel genes.

Highlights

  • The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome

  • Reliance on length cutoffs can result in longer random open reading frame (ORF) being falsely annotated as protein coding and can lead to the exclusion of true small ORFs such as those that have been identified in yeast, humans, and Arabidopsis (Basrai et al, 1997; Hanada et al, 2007; Pruitt et al, 2007)

  • We investigated whether intergenic transcribed fragments (ITFs) are likely protein coding using (1) ribosome immunoprecipitation data generated in this study as well as public data sets (Jiao and Meyerowitz, 2010), (2) proteomics data (Baerenfaller et al, 2008; Castellana et al, 2008), and (3) fusion protein expression studies on selected targets

Read more

Summary

Introduction

The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. Studies in yeast (Saccharomyces cerevisiae; David et al, 2006), animals (Bertone et al, 2004; Carninci et al, 2005), and plants (Yamada et al, 2003; Li et al, 2007; Matsui et al, 2008) have revealed the presence of a large number of unannotated, novel transcripts These novel transcripts may represent alternatively spliced forms of known genes (Filichkin et al, 2010), products of antisense (Yamada et al, 2003) or bidirectional transcription (Xu et al, 2009), retained introns (Ner-Gaon et al, 2004; Filichkin et al, 2010), transcript fusions (Ruan et al, 2007), or intergenic transcriptional units (referred to hereafter as intergenic transcribed fragments [ITFs]). The main question concerns the abundance of functional ITFs relative to those derived from noisy transcription

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call