Abstract

MotivationPosition-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing.ResultsWe present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average.Availability and implementationSoftware implementation is available from https://github.com/jttoivon/moder2.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Transcription factors (TFs) regulate the expression of their target genes by binding to specific DNA sequence segments in the promoter and enhancer areas of the targets

  • The expectation maximization (EM) search is initialized with user-given seed sequences for the monomeric motifs to be learned, and the search is restricted to a user-given range of spacings and orientations of dimers

  • Given a seed CYMRTAAAA and Hamming radii q 1⁄4 2; 3; . . . ; 9, and 1, MODER2 accurately relearned the model from this data when total signal fraction was 0.3 or 0.9: the learned parameters differed from the original at most by 0.188 in weighted maximum norm (Supplementary Section S1), and for larger radii, the difference was smaller, radii 7 and 8 giving the smallest differences; see Supplementary Table S2

Read more

Summary

Introduction

Transcription factors (TFs) regulate the expression of their target genes by binding to specific DNA sequence segments (motifs) in the promoter and enhancer areas of the targets. Binding TFs may form clusters of two or more factors which makes the regulation combinatorial by nature (De Val et al, 2008; Gordan and Siggers, 2013; Jolma et al, 2015; Morgunova and Taipale, 2017; Panne et al, 2007; Rodda et al, 2005). Models that represent dimeric motifs are composed of models for the monomeric motifs involved, plus a description of the structure of the dimer. Such a description represents the preferred relative spacings and orientations of the monomeric components of the dimer as well as models the co-operative effects

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call