Abstract

A position weight matrix (PWM) is widely accepted as a probabilistic representation for modeling protein-DNA binding specificity. Previous studies showed that for factors which bind to divergent binding sites, mixtures of multiple PWMs improve performance. We propose a consensus scaffolded mixutre PWM (CSM) model to improve cis-regulatory elements modeling by allowing overlapping components represented by a set of PWMs, each of which corresponds to a binding pattern and is scaffolded by a degenerate consensus. In addition, we propose a learning algorithm that involves an initial structure learning stage based on the frequent pattern mining and a refining stage based on the expectation maximization (EM) algorithm. We assess the merits of CSM using three independent criteria. In a case-study of transcription factor Leu3, the derived CSM models agree with conventional mixtures but show better fitness according to Fermi-Dirac distribution. Analysis of the human-mouse conservation of predicted binding sites of 83 JASPAR transcription factors (TFs) shows that the CSM is as good as or better than the simple mixture, the context-specific independent (CSI) mixture, and the single PWM model, for 83%, 84%, and 75% of the cases, respectively. Five-fold cross validation on 46 TRANSFAC datasets shows that CSM model has better generality than other mixture models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.