Abstract

Extensive transcriptional activity occurring in intergenic regions of genomes has raised the question whether intergenic transcription represents the activity of novel genes or noisy expression. To address this, we evaluated cross-species and post-duplication sequence and expression conservation of intergenic transcribed regions (ITRs) in four Poaceae species. Among 43,301 ITRs across the four species, 34,460 (80%) are species-specific. ITRs found across species tend to be more divergent in expression and have more recent duplicates compared to annotated genes. To assess if ITRs are functional (under selection), machine learning models were established in Oryza sativa (rice) that could accurately distinguish between phenotype genes and pseudogenes (area under curve-receiver operating characteristic = 0.94). Based on the models, 584 (8%) and 4391 (61%) rice ITRs are classified as likely functional and nonfunctional with high confidence, respectively. ITRs with conserved expression and ancient retained duplicates, features that were not part of the model, are frequently classified as likely-functional, suggesting these characteristics could serve as pragmatic rules of thumb for identifying candidate sequences likely to be under selection. This study also provides a framework to identify novel genes using comparative transcriptomic data to improve genome annotation that is fundamental for connecting genotype to phenotype in crop and model systems.

Highlights

  • Transcriptome sequencing has led to the discovery of pervasive transcription in unannotated, intergenic space in eukaryotes, including metazoan[1,2,3,4], fungal[5], and plant species[6,7,8,9,10]

  • To more definitively determine the relationship between the timing of duplication events and intergenic transcribed regions (ITRs) duplicate retention, we identified ITRs present in duplicated genome blocks derived from whole genome duplication (WGD) events

  • We investigated the cross-species and post-duplication evolutionary characteristics of intergenic transcribed regions (ITRs) in four grass species

Read more

Summary

Introduction

Transcriptome sequencing has led to the discovery of pervasive transcription in unannotated, intergenic space in eukaryotes, including metazoan[1,2,3,4], fungal[5], and plant species[6,7,8,9,10]. Instead of relying on a single line of evidence such as sequence conservation, an approach that integrates genetic, evolutionary, and biochemical evidence has been suggested[24] Based on this framework, predictive models were established that were highly effective at identifying sequences with significant fitness cost when mutated[25] and distinguishing human and Arabidopsis thaliana protein coding and RNA genes from pseudogenes[26,27]. Integration of evolutionary and biochemical signatures could provide valuable insight in distinguishing functional and noisy ITRs. In this study, we investigate the extent of sequence and expression conservation as well as potential functionality of ITRs using data from four Poaceae (grass) species: Oryza sativa (rice), Brachypodium distachyon, Sorghum bicolor (sorghum), and Zea mays (maize). Using rice as an example, we generated function prediction models by integrating rice mutant phenotype, sequence conservation, transcriptome, histone modification, DNA methylation, and nucleosome occupancy data to predict functional ITRs genome-wide

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.