BackgroundRepetitive DNA sequences (Repeats) are significant regions in the human genome that have a specific genomic distribution, structure, and several binding sites for genome architecture and function. In consequence, the possible configurations of Repeats in specific and dynamic regions like the gene promoters could define footprints for molecular mechanisms, pathways, and cell function beyond their density in the genome. Here we explored the distribution of Repeats in the upstream promoter region of the human coding genes with the aim to identify specific configurations, clusters and functional meaning of those elements. Our method includes structural descriptions, hierarchical clustering, pathway association, and functional enrichment analysis.ResultsWe report here several configurations of Repeats in the upstream promoter region (UPR), which define 2729 patterns for the 80% of the human coding genes. There are 47 types of Repeats in these configurations, where the most frequent were Alu, Low_complexity, MIR, Simple_repeat, LINE/L2, LINE/L1, hAT-Charlie, and ERV1. The distribution, length, and the high frequency of Repeats in the UPR defines several patterns and clusters, where the minimum frequency of configuration among Repeats was higher than 0.7. We found those clusters associated with cellular pathways and ontologies; thus, it was plausible to determine groups of Repeats to specific functional insights, for example, pathways for Genetic Information Processing or Metabolism shows particular groups of Repeats with specific configurations.ConclusionBased on these findings, we propose that specific configurations of repetitive elements describe frequent patterns in the upstream promoter for sets of human coding genes, which those correlated to specific and essential cell pathways and functions.
Read full abstract