Abstract

Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 “promoter-specific” transcription factors), those that bind preferentially to the [0,500] region (282 “5′ UTR-specific” TFs), and 207 of the “promiscuous” transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.

Highlights

  • Core promoters are the 5’ regions adjacent to the transcriptional start site (TSS) and containing binding sites for transcription factors (TFBS)

  • We present a comprehensive analysis of genomic features associated with the promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of full-length cDNA sequences

  • For every gene in both models, we extracted a 1,000 nt long sequence centered at the TSS, and calculated distributions of genomic features previously associated with the start of transcription: (1) frequency of dinucleotide CA [1, 30, 31]; (2) frequency of TATA [1, 4, 32]; (3) nucleotide consensus around TSS [12, 13, 33]; (4)

Read more

Summary

Introduction

Core promoters are the 5’ regions adjacent to the transcriptional start site (TSS) and containing binding sites for transcription factors (TFBS). Computational analysis of the eukaryotic promoters is hindered by their complex architecture [1,2,3]. Each gene contains one or more TSS, and, respectively, one or more promoters, which initiate transcription of a gene. From 30% to 60% of eukaryotic genes contain the TATA motif approximately. Supported by the NSF Division of Environmental Biology (1456634). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call