AbstractIt has long been known that exons can serve as cis‐regulatory sequences, such as enhancers. However, the prevalence of such dual‐use of exons and how they evolve remain elusive. Based on our recently predicted, highly accurate large sets of cis‐regulatory module candidates (CRMCs) and non‐CRMCs in the human genome, we find that exonic transcription factor binding sites (TFBSs) occupy at least a third of the total exon lengths, and 96.7% of genes have exonic TFBSs. Both A/T and C/G in exonic TFBSs are more likely under evolutionary constraints than those in non‐CRMC exons. Exonic TFBSs in codons tend to encode loops rather than more critical helices and strands in protein structures, while exonic TFBSs in untranslated regions (UTRs) tend to avoid positions where known UTR‐related functions are located. Moreover, active exonic TFBSs tend to be in close physical proximity to distal promoters whose genes have elevated transcription levels. These results suggest that exonic TFBSs might be more prevalent than originally thought and likely in dual‐use. We proposed a parsimonious model that well explains the observed evolutionary behaviors of exonic TFBS as well as how a stretch of codons evolve into a TFBS.Key points There are more exonic regulatory sequences in the human genome than originally thought. Exonic transcription factor binding sites are more likely under negative selection or positive selection than counterpart nonregulatory sequences. Exonic transcription factor binding sites tend to be located in genome sequences that encode less critical loops in protein structures, or in less critical parts in 5′ and 3′ untranslated regions.
Read full abstract