Abstract
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Highlights
Changes in gene expression play a crucial role in a wide variety of cellular processes
Predictive models trained on Massively Parallel Reporter Assays (MPRAs) are more likely to be sensitive to identifying functional regulatory patterns that affect gene expression
The increasing size and design complexity of MPRAs in the literature motivated us to develop MPRA-DragoNN, a convolutional neural network (CNN)-based predictive model for learning de novo regulatory patterns from noncoding DNA sequences based on their MPRA activity
Summary
Changes in gene expression play a crucial role in a wide variety of cellular processes. Functional genomic assays developed over the last decade (such as ChIP-seq, DNase/ATAC-seq, and others) have allowed for candidate cis-regulatory elements (cCREs) to be mapped on a genome-wide scale in a wide variety of cell lines and tissues 2,3. They have more recently been supplemented by massively parallel quantitative measurements of the regulatory activity of native cCREs and synthetic constructs in the form of Massively Parallel Reporter Assays (MPRAs) 4–8 and Self-Transcribing Active Regulatory Regions sequencing (STARR-seq) 9–13 as well as direct highthroughput perturbations of cCREs in their native contexts using pooled CRISPR screens 14,15
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.