Abstract

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

Highlights

  • Changes in gene expression play a crucial role in a wide variety of cellular processes

  • Predictive models trained on Massively Parallel Reporter Assays (MPRAs) are more likely to be sensitive to identifying functional regulatory patterns that affect gene expression

  • The increasing size and design complexity of MPRAs in the literature motivated us to develop MPRA-DragoNN, a convolutional neural network (CNN)-based predictive model for learning de novo regulatory patterns from noncoding DNA sequences based on their MPRA activity

Read more

Summary

Introduction

Changes in gene expression play a crucial role in a wide variety of cellular processes. Functional genomic assays developed over the last decade (such as ChIP-seq, DNase/ATAC-seq, and others) have allowed for candidate cis-regulatory elements (cCREs) to be mapped on a genome-wide scale in a wide variety of cell lines and tissues 2,3. They have more recently been supplemented by massively parallel quantitative measurements of the regulatory activity of native cCREs and synthetic constructs in the form of Massively Parallel Reporter Assays (MPRAs) 4–8 and Self-Transcribing Active Regulatory Regions sequencing (STARR-seq) 9–13 as well as direct highthroughput perturbations of cCREs in their native contexts using pooled CRISPR screens 14,15

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call