High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.

Phaedra Agius,Christina Leslie,Aaron Arvey,William Stafford Noble,William Chang,Uwe Ohler

doi:10.1371/journal.pcbi.1000916

Abstract

Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.

Highlights

Gene regulatory programs are orchestrated by transcription factors (TFs), proteins that coordinate expression of target genes both through direct interaction with DNA and with non-DNAbinding accessory proteins
We found that the ChIP-derived support vector machines (SVMs) models significantly improve TF occupancy prediction in mammalian genomes when compared to protein binding microarray (PBM)-derived support vector regression (SVR) models
To compare pairs of probe sequences for SVR training, we developed a novel string kernel called the di-mismatch kernel, which is a k-mer based string kernel adapted to the problem of TF binding models

Summary

Introduction

Gene regulatory programs are orchestrated by transcription factors (TFs), proteins that coordinate expression of target genes both through direct interaction with DNA and with non-DNAbinding accessory proteins (cofactors). Modeling the DNA sequence preferences of these TFs, and using these sequence preferences in an appropriate way to predict whether the TF can bind a genomic site in vivo, are key pieces in unraveling the regulatory code For many years, these efforts have been frustrated by the limited availability and quality of TF binding site motifs, usually represented as a position-specific scoring matrix (PSSM) or a consensus sequence. These efforts have been frustrated by the limited availability and quality of TF binding site motifs, usually represented as a position-specific scoring matrix (PSSM) or a consensus sequence These motifs may match thousands of sites in intergenic regions, producing an unreliable list of potential TF target genes. Since it is not feasible to collect occupancy data for all TFs and all possible cellular contexts, we must develop better methods for predicting in vivo occupancy, which will depend in part on improving our models of TF binding preferences

Objectives

Methods

Results

Conclusion