Abstract

Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Data availability References Decision letter Author response Article and author information Metrics Abstract Genetic and environmental exposures cause variability in gene expression. Although most genes are affected in a population, their effect sizes vary greatly, indicating the existence of regulatory mechanisms that could amplify or attenuate expression variability. Here, we investigate the relationship between the sequence and transcription start site architectures of promoters and their expression variability across human individuals. We find that expression variability can be largely explained by a promoter’s DNA sequence and its binding sites for specific transcription factors. We show that promoter expression variability reflects the biological process of a gene, demonstrating a selective trade-off between stability for metabolic genes and plasticity for responsive genes and those involved in signaling. Promoters with a rigid transcription start site architecture are more prone to have variable expression and to be associated with genetic variants with large effect sizes, while a flexible usage of transcription start sites within a promoter attenuates expression variability and limits genotypic effects. Our work provides insights into the variable nature of responsive genes and reveals a novel mechanism for supplying transcriptional and mutational robustness to essential genes through multiple transcription start site regions within a promoter. Editor's evaluation This paper presents valuable findings about how human genetic variation impacts gene expression. Using a compelling analysis of new experimental data based on cell lines from 108 individuals, the authors uncover features that distinguish promoters with highly variable expression across individuals from those exhibiting low variability. This work and the associated resource will be of broad interest for further investigations of the interplay between genetic variation and gene expression control. https://doi.org/10.7554/eLife.80943.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Transcriptional regulation is the main process controlling how genome-encoded information is translated into phenotypes. Hence, understanding how transcriptional regulation influences gene expression variability is of fundamental importance to understand how organisms are capable of generating proper phenotypes in the face of stochastic, environmental, and genetic variation. Through differentiation, cells acquire highly specialized functions, but need to still maintain their general abilities to accurately regulate both essential pathways as well as responses to changes in the environment. To achieve robustness, regulatory processes must be capable of attenuating expression variability of essential genes (Bartha et al., 2018), while still allowing, or possibly amplifying (Eldar and Elowitz, 2010; Urban and Johnston, 2018), variability in expression for genes that are required for differentiation or responses to environmental changes and external cues. How cells can achieve such precision and robustness remains elusive. Genetic variation affects the expression level (Montgomery et al., 2010; Pickrell et al., 2010; Stranger et al., 2007) of the majority of human genes (Battle et al., 2017; Lappalainen et al., 2013; Storey et al., 2007). However, genes are associated with highly different effect sizes, with ubiquitously expressed or essential genes frequently being less affected (Battle et al., 2017). This indicates that genes associated with different regulatory programs are connected with different mechanisms or effects of mutational robustness (Payne and Wagner, 2015). Multiple transcription factor (TF) binding sites may act to buffer the effects of mutations in promoters (Spivakov et al., 2012), and promoters can have highly flexible transcription start site (TSS) architectures (Akalin et al., 2009; Carninci et al., 2006; Lehner, 2008). This demonstrates that the sequence and architecture of a promoter may influence its variability in expression across individuals. Previous studies aimed at identifying processes involved in the regulation of gene expression variability have indeed revealed regulatory features mostly associated with the promoters of genes, such as CpG islands and TATA-boxes (Morgan and Marioni, 2018; Ravarani et al., 2016; Sigalova et al., 2020), the chromatin state around gene TSSs (Faure et al., 2017), and the propensity of RNA polymerase II to pause downstream of the TSS (Boettiger and Levine, 2009). These studies have relied on model organisms or focused on transcriptional noise across single cells. As of yet, regulatory features have not been thoroughly studied from the perspective of variability in promoter activity or across human individuals. Furthermore, it is unclear if regulation of variability mainly acts to attenuate variability to achieve stable expression for certain genes or if independent regulatory processes act in parallel to amplify variability for other genes. Here, we provide a comprehensive characterization of the sequences, TSS architectures, and regulatory processes determining variability of promoter activity across human lymphoblastoid cell lines (LCLs). We find that variability in promoter activity is to a large degree reflected by the promoter sequence, notwithstanding possible genotypic differences. Furthermore, the presence of binding sites for specific TFs, including those of the ETS family, are highly predictive of low promoter variability independently of their impact on promoter expression levels. In addition, we demonstrate that differences in the variability of promoters reflect their involvement in distinct biological processes, indicating a selective trade-off between stability and plasticity of promoters. Finally, we show that flexibility in TSS usage is associated with attenuated promoter variability. Our results reveal a novel mechanism that confers mutational robustness to gene promoters via switches between proximal core promoters. This study provides fundamental insights into transcriptional regulation, indicating shared mechanisms that can buffer stochastic, environmental, and genetic variation and how these affect the responsiveness and cell-type restricted activity of genes. Results TSS profiling reveals variability in promoter activity across individuals To characterize human variability in promoter activities, we profiled TSSs using CAGE (Cap Analysis of Gene Expression Takahashi et al., 2012; Figure 1A) across 108 Epstein-Barr virus (EBV)-transformed LCLs (Auton et al., 2015) of African origin, 89 from Yoruba in Ibadan, Nigeria (YRI) and 19 from Luhya in Webuye, Kenya (LWK) (Supplementary file 1). The samples had a balanced sex ratio, 56 females and 52 males, and no observable population stratification in the expression data (Figure 1—figure supplement 1). With CAGE, TSSs can be mapped with single base pair resolution and the relative number of sequencing reads supporting each TSS gives simultaneously an accurate estimate of the abundance of its associated RNA (Kawaji et al., 2014). The CAGE data across the LCL panel therefore give us a unique opportunity to both estimate variability in promoter activity and characterize the regulatory features influencing such variability. Figure 1 with 1 supplement see all Download asset Open asset CAGE profiling of TSSs reveals diverse promoter variability across individuals. (A) Illustration of the experimental design and approach for measuring promoter activity and variability. Capped 5’ ends of RNAs from LCLs derived from 108 individuals were sequenced with CAGE, followed by individual-agnostic positional clustering of proximal CAGE-inferred TSSs (first 5’ end bp of CAGE reads). The expression level of the resulting CAGE-inferred promoters proximal to annotated gene TSSs were quantified in each individual and used to measure promoter variability. (B) Example of promoter activity (TPM normalized count of CAGE reads) across individuals for a low variable promoter (gene RPL26L1) and a highly variable promoter (gene SIX3) with similar average expression across the panel. (C–D) Genome tracks for two promoters showing average TPM-normalized CAGE data (expression of CAGE-inferred TSSs) across individuals (top track) and TPM-normalized CAGE data for three individuals (bottom tracks) for a low variable promoter (panel C, gene RPL26L1) and a highly variable promoter (panel D, gene SIX3). (E–F) The CV2 (squared coefficient of variation) and mean expression relationship of 29,001 CAGE-inferred promoters across 108 individuals before (E) and after (F) adjustment of the mean expression-dispersion relationship. The CV2 and mean expression are log10 transformed, orange lines show loess regression lines fitting the dispersion to the mean expression level, and example gene promoters from B-D are highlighted in colors. We identified 29,001 active promoters of 15,994 annotated genes (Frankish et al., 2019) through positional clustering of proximal CAGE-inferred TSSs on the same strand (Figure 1A; Carninci et al., 2006) detected in at least 10% of individuals (Supplementary file 2). This individual-agnostic strategy ensured a focus on promoters that are active across multiple individuals while also allowing for the measurement of variability in promoter activity across the panel. For example, the CAGE data revealed that the promoters of gene RPL26L1, encoding a putative component of the large 60 S subunit of the ribosome, and transcription factor gene SIX3 have highly different variance yet similar mean expression across individuals (Figure 1B–D). We used the squared coefficient of variation (CV2) as a measure of promoter expression dispersion, revealing how the normalized expression across individuals deviates from the mean for each identified promoter. We observed that the promoter CV2 decreases by increasing mean expression (Figure 1E; Eling et al., 2018; Kolodziejczyk et al., 2015; Sigalova et al., 2020). To account for this bias, we subtracted the expected dispersion for each promoter according to its expression level (Kolodziejczyk et al., 2015; Newman et al., 2006). Importantly, rank differences in promoter dispersion were maintained for each expression level after adjustment, as seen for promoters of genes RPL26L1 and SIX3 (Figure 1E and F). This strategy thus allowed us to investigate how promoter architecture and sequence determine variability in promoter activity across the panel separately from its impact on expression level (Figure 1F). Promoter expression variability is reflected by the promoter sequence To investigate if local sequence features at promoters reflect their variability in activity, we applied machine learning (convolutional neural network (CNN); Figure 2—figure supplement 1A; see Materials and methods) to discern low variable promoters (N=5,054) from highly variable promoters (N=5,683) based on their local DNA sequence alone. We considered the genomic reference sequence to model the intrinsic component of variability encoded within the promoter sequence independently of local genetic variants within the panel. The resulting model was capable of distinguishing between these promoter classes with high accuracy (area under receiving operating curve (AUC)=0.84 for the out of sample test set; Figure 2—figure supplement 1B), equally well for highl and low variable promoters (per-class test set F1 scores of 0.76 and 0.77, respectively). To assess which sequence features the CNN had learned to distinguish the classes, we calculated importance scores using DeepLift (Shrikumar et al., 2019) for each nucleotide in the input sequences for predicting low and high promoter variability. This approach relies on backpropagation of the contributions of all neurons in the CNN to the input features, nucleotides, and can therefore be used to identify properties or short stretches of DNA indicative of amplifying or attenuating expression variability. We applied motif discovery on clustered stretches, so called metaclusters, of the input sequences with high importance scores (Shrikumar, 2020) and matched the identified metaclusters to known TF binding motifs (Fornes et al., 2020). This strategy revealed TFs indicative of either high or low promoter variability (Figure 2A–C). Noteworthy, we observed motifs for the ETS superfamily of TFs, including ELK1, ETV6, and ELK3, associated with low variable promoters, and motifs for PTF1A, ASCL2, and FOS-JUN heterodimer (AP-1) among highly variable promoters. These results demonstrate that the promoter sequence and its putative TF binding sites are predictive of the expression variability of a promoter. Figure 2 with 5 supplements see all Download asset Open asset Promoter sequence features are highly predictive of promoter variability. (A) Sequence logo of a metacluster (top) identified for low variable promoter sequences that matches known TF motifs (bottom) for ETS factors ELK1, ETV6, and ELK3. (B–C) Sequence logos of two metaclusters (top) identified for highly variable promoter sequences that match known TF motifs (bottom) for PTF1A and ASCL2 (B) and FOSL2-JUND and FOS-JUN heterodimers (C). (D) Average contribution (SHAP values) of CpG content and each of the 124 TFs identified as important for predicting promoter variability. Features are ordered by their average contribution to the prediction of highly variable promoters and selected TFs are highlighted. For a full version of the plot see Figure 2—figure supplement 3A. (E) The frequency of predicted TF binding sites (presence/absence) in highly variable (green) and low variable (blue) promoters. TFs are ordered as in D. For a full version of the plot see Figure 2—figure supplement 3B and C. (F–G) Promoters split into groups based on the presence/absence of high CpG content (F), and predicted binding sites of ELK3 (G). For both features displayed in panels F and G, the left subpanels display the relationship between log10-transformed mean expression levels and adjusted log10-transformed CV2 with loess regression lines shown separately for each promoter group. The right subpanels display box-and-whisker plots of the differences in adjusted log10-transformed CV2 between the two promoter groups (central band: median; boundaries: first and third quartiles; whiskers:+/-1.5 IQR). p-values were determined using the Wilcoxon rank-sum test. Sequence features of promoters are highly predictive of promoter variability To systematically test how predictive TF binding sites are of the variability of active promoters, we made use of binding sites predicted from motif scanning for 746 TFs (Fornes et al., 2020). TF binding site profiles and low/high CpG content (Figure 2—figure supplement 2A) were collected for each identified promoter and the resulting feature data were used to train a machine learning (random forest) classifier features associated with either high or low variability (low variable N=5054, highly variable N=5683). Feature selection (Kursa and Rudnicki, 2010) identified 124 of the 746 TFs as well as CpG ratio to be important for classification, and a classifier based on these selected features demonstrated high predictive performance (AUC = 0.79; per-class F1 score of 0.73 and 0.68 for low and highly variable promoters, respectively; Figure 2—figure supplement 2B), reinforcing the strong link observed between DNA sequence and promoter variability (Figure 2—figure supplement 1B). Reverse engineering of the random forest classifier (SHAP, Shapley additive explanations) (Lundberg and Lee, 2017) allowed us to calculate the marginal contribution of each of the 125 selected features to the prediction of variability class for each promoter and whether the feature on average is indicative of amplifying or attenuating variability of expression when present in the promoter sequence (Figure 2D; Figure 2—figure supplement 3A). We identified the presence of high observed/expected CpG ratio and TATA-binding protein (TBP) binding sites (TATA-boxes) to be the strongest predictive features of low and high promoter variability, respectively. Although the remaining TFs contribute only marginally on their own to the predictions compared to TATA-box and high CpG content status (Figure 2D; Figure 2—figure supplement 3A), a baseline model (decision tree) based on CpG ratio and TBP binding site presence alone yielded worse performance than the full model (AUC = 0.71 vs 0.79 for the baseline model and the full model, respectively; Figure 2—figure supplement 2B). This demonstrates that the TF binding grammar contributes to a promoter’s expression variability. TFs associated with highly variable promoters are mostly related to tissue-specific or developmental regulation (e.g., FOXP2, HOXA10) while TFs predictive of low promoter variability are generally associated with ubiquitous activity across cell types and a diverse range of basic cellular processes (e.g. ELK1, ELF4, ETV3). In addition, TFs predictive of high variability (e.g. ZIC2, ZNF449, HOXA10) tend to have binding sites in relatively few highly variable promoters while TFs predictive of low promoter variability (e.g. ELK1, ELK3) show a propensity for having binding sites present in a large number of promoters (Figure 2E; Figure 2—figure supplement 3B and C). This suggests that variably expressed promoters have diverse TF binding profiles and that the regulatory grammar for promoter stability is less complex. Although the adjusted dispersion of promoters was separated from their expression level (Figure 1E), we observed that the presence of binding sites for some TFs that are predictive of promoter variability are also associated with promoter expression level (Figure 2—figure supplement 4). Importantly, despite this association, the effect of identified features on promoter variability is still evident across a range of promoter expression levels (Figure 2F and G). This is particularly apparent for CpG islands, which seem to have an attenuating effect on promoter variability regardless of expression level (Figure 2F). Many of the TFs identified as being predictive of low variability (e.g. ELK1, ELK3, ELF4, ETV2, ETV3) belong to the ETS family of TFs (Figure 2D; Figure 2—figure supplement 3A), a large group of TFs that are conserved across Metazoa and characterized by their shared ETS domain that binds 5′-GGA(A/T)–3′ DNA sequences (Sharrocks, 2001). ETS factors are important regulators of promoter activities in lymphoid cells (Hepkema et al., 2020), but are generally involved in a wide range of crucial cellular processes such as growth, proliferation, apoptosis, and cellular homeostasis (Kar and Gutierrez-Hartmann, 2013; Oikawa and Yamada, 2003; Suico et al., 2017). Furthermore, different ETS factors can bind in a redundant manner to the same promoters of housekeeping genes (Hollenhorst et al., 2011; Hollenhorst et al., 2007). However, the shared DNA-binding domain of ETS factors makes it hard to discern individual factors based on their binding motifs alone (Figure 2A). Although in general linked with higher promoter activity (Curina et al., 2017), ETS binding site presence is associated with lower variability across all expression levels (Figure 2G). In addition, the degree of promoter variability decreases by an increasing number of non-overlapping ETS binding sites (Figure 2—figure supplement 5A), regardless of promoter expression level (Figure 2—figure supplement 5B), suggesting that multiple ETS binding sites can either facilitate cooperativity between ETS factors or provide robustness to stabilize promoter variability across individuals. Taken together, our results indicate that promoter sequence can influence both low and high promoter variability across human individuals independently from its impact on expression level. Our results further indicate that variable promoters exhibit highly diverse binding grammars for TFs that are associated with relatively few promoters, while a more uniform regulatory grammar is indicated for stable promoters, being highly associated with higher CpG content and ETS binding sites. Variability in promoter activity reflects plasticity and robustness for distinct biological functions The high performance of predicting promoter variability from local DNA sequence and the distinct TF binding profiles of low and highly variable promoters imply that attenuation and amplification of variability are driven by distinct regulatory mechanisms. This argues that favoring robustness (low variability) over plasticity (high variability) should reflect the biological processes where this provides regulatory advantages. Supporting this hypothesis, we observed that low variable promoters were highly enriched with basic cellular housekeeping processes, in particular metabolic processes (Figure 3A). In contrast, highly variable promoters were enriched with more dynamic biological functions, including signaling, response to stimulus, and developmental processes. Figure 3 with 1 supplement see all Download asset Open asset Levels of promoter variability are reflective of distinct biological processes and a selective trade-off between robustness and plasticity. (A) GO term enrichment, for biological processes, of genes split by associated promoter variability quartiles (Q1, Q2, Q3). Top 10 GO terms of all groups are displayed and ranked based on p-values of the >Q3 variability group. (B) Median promoter variability (line) and interquartile range (shading), as a function of the number of FANTOM cell facets (grouping of FANTOM CAGE libraries associated with the same Cell Ontology term) that the associated gene is expressed in (mean facet expression >5 TPM). (C) The number of differentially expressed promoters, split by variability quartiles, after 6 h TNFα treatment. Promoters are separated into down-regulated (blue) and up-regulated (red). p-values were calculated using Fisher’s exact test. (D) Absolute log2 fold change of differentially expressed promoters, split by variability quartiles, after 6 h of TNFα treatment. (E) Distribution of promoter variability associated with drug-targets (purple), essential (orange), or GWAS hits (green) genes, compared to all promoters (black). Left: density plot of promoter variability per gene category. Right: Box-and-whisker plots of promoter variability split by each category of genes. p-values were determined using the Wilcoxon rank-sum test. For all box-and-whisker plots, central band: median; boundaries: first and third quartiles; whiskers:+/-1.5 IQR. Interestingly, the same features found to be predictive of low and high promoter variability across individuals, including CpG-content and TATA-boxes (TBP binding sites), are also associated with low and high transcriptional noise across individual cells (Faure et al., 2017; Morgan and Marioni, 2018). The presence of a TATA-box is also associated with high gene expression variability in flies (Sigalova et al., 2020). This suggests that some of the same underlying regulatory mechanisms that dictate low or high transcriptional noise across single cells are maintained and conserved between humans and flies at an individual level and manifested to control low and high expression variability across a population, respectively, as well as housekeeping or restricted activity across cell types. In agreement, genes of highly variable promoters tend to have higher transcriptional noise than those of low variable ones across GM12878 single cells (Cohen’s d=0.411, p<2.2 × 10–16, two sample t-test; Figure 3—figure supplement 1A; Supplementary file 3). Furthermore, we observed an inverse correlation between variability in promoter activity and the number of cell types (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014) and tissues (Battle et al., 2017) the corresponding gene is expressed in (Spearman’s rank correlation ρ = −0.21 and −0.15 for cell types and tissues, respectively, p< 2.2 × 10–16; Figure 3B; Figure 3—figure supplement 1B), demonstrating that highly variable promoters are more cell-type and tissue specific in their expression. The restricted expression (Figure 3—figure supplement 1B), biological processes (Figure 3A), and promoter TF grammar (Figure 2D and E) of genes associated with highly variable promoters led us to hypothesize that these are more prone to respond to external stimuli. Tumor necrosis factor (TNFα) induces an acute and time-limited gene response to NFkB signaling (Nelson et al., 2004; Turner et al., 2010), with negligible impact on chromatin topology (Jin et al., 2013), and is therefore suitable to study gene responsiveness. We profiled GM12878 TSSs and promoter activities with CAGE before and after 6 hr treatment with TNFα (Supplementary file 4). This revealed enrichment of up-regulated promoters among highly variable promoters (odds ratio (OR)=1.529, p=4.563 × 10–8, Fisher’s exact test) as posited, while low variable promoters were mostly unaffected or down-regulated (OR = 0.459, p<2.2 × 10–16, Fisher’s exact test; Figure 3C). In addition, low variable promoters had a weaker response (Figure 3D). Furthermore, we observed drug-target genes and genes with GWAS hits to be regulated by highly variable promoters but essential genes to be regulated by low variable promoters (Figure 3E). In contrast, when we compared promoter expression between these same groups of genes we observed no association with drug-targets or GWAS-associated genes. Although essential genes are associated with higher promoter expression, this association is comparably weaker than that with promoter variability (Figure 3—figure supplement 1C). Taken together, our results demonstrate the importance of low promoter variability for cell viability and survival in different conditions and reveal the responsiveness of highly variable promoters. They further indicate that the variability observed in promoter activity across individuals is strongly associated with the regulation of its associated gene, the expression breadth across cell types, and to some extent also the transcriptional noise across single cells, which reflects a selective trade-off between high stability and high responsiveness and specificity. Promoters with low variability have flexible transcription initiation architectures Promoters are associated with different levels of spread of their TSS locations, which has led to their classification into broad or narrow (sharp) promoters according to their positional width (Akalin et al., 2009; Carninci et al., 2006; Lehner, 2008). Although the shape and distinct biological mechanisms of these promoter classes, for example, housekeeping activities of broad promoters, are conserved across species (Carninci et al., 2006; Hoskins et al., 2011), the necessity for positional dispersion of TSSs and its association with promoter variability are poorly understood. Analysis of promoter widths revealed only a weak relationship with promoter variability. We observed an enrichment of highly variable promoters within narrow promoters having an interquartile range (IQR) of their CAGE-inferred TSSs within a width of 1–5 bp (p<2.2 × 10–16, OR = 2.04, Fisher’s exact test). Low variable promoters, on the other hand, were enriched among those of size 10–25 bp (p<2.2 × 10–16, OR = 1.44, Fisher’s exact test), but beyond this width the association is lost (Figure 4—figure supplement 1A). To simultaneously capture the spread of TSSs and their relative frequencies compared to total RNA expression within a promoter, we considered a width-normalized Shannon entropy as a measure of TSS positional dispersion (Hoskins et al., 2011). This measure will discern promoters whose relative TSS expression is concentrated to a small subset of their widths (low entropy) from those with a more even spread (high entropy). We observed that low variable promoters are associated with a higher entropy than promoters with high variability (Figure 4A). Consistently, low variable promoters tend to have more TSSs substantially contributing to their overall expression across individuals (Figure 4—figure supplement 1B). We reasoned that a weaker association between low promoter variability and broad width than with high entropy may be due to low variable promoters being composed of multiple clusters of TSSs (multi-modal peaks) from independent core promoters. Indeed, decomposition of multi-modal peaks within the CAGE TSS signals of promoters (Supplementary file 5) demonstrated that higher entropy reflects an increased number of decomposed promoters, as indicated by their number of local maxima of CAGE signals (Figure 4B). Figure 4 with 2 supplements see all Download asset Open asset Low variable promoters exhibit flexibility in transcription initiation architecture. (A) Promoter shape entropy for promoters split by variability quartiles, displayed as densities (left) and in a box-and-whisker plot (right). (B) Illustration of the local maxima decomposition approach (left; see Materials and methods) and box-and-whisker plot displaying the relationship between the Shannon entropy and the number of local maxima-inferred decomposed promoters. (C–D) Examples of two promoters each containing two decomposed promoters exhibiting low correlation across individuals (panel C, gene UFSP2) and high correlation across individuals (panel D, gene RIT1). Both panels display genome tracks of average, TPM-normalized CAGE-inferred TSS expression levels across the panel (Pooled, top track) and for three individuals (GM18908, GM19152, GM18504, lower tracks). Below the genome tracks, the original p

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call