Abstract

BackgroundComputational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics. A systematic comparison of 26 algorithms to learn TF-DNA binding specificity in in vitro protein-binding microarray (PBM) data was published recently, but the quality of those examined PBMs was not evaluated completely.ResultsHere, new quality-control parameters such as principal component analysis (PCA) ellipse is proposed to assess the data quality for either single or paired PBMs. Additionally, a biophysical model of TF-DNA interactions including adjacent dinucleotide interdependence was implemented in a new program - BayesPI2, where sparse Bayesian learning and relevance vector machine are used to predict unknown model parameters. Then, 66 mouse TFs from the DREAM5 challenge were classified into two groups (i.e. good vs. bad) based on the paired PBM quality-control parameters. Subsequently, computational methods to model TF sequence specificity were evaluated between the two groups.ConclusionResults indicate that both the algorithm performance and the predicted TF-binding energy-level of a motif are significantly influenced by PBM data quality, where poor PBM data quality is linked to specific protein domains (e.g. C2H2 DNA-binding domain). Especially, the new dinucleotide energy-dependent model (BayesPI2) offers great improvement in testing prediction accuracy over the simple energy-independent model, for at least 21% of analyzed the TFs.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-289) contains supplementary material, which is available to authorized users.

Highlights

  • Computational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics

  • By using the energy-dependent model of BayesPI2 or BEEML-protein-binding microarray (PBM) [1], about 14 and 5 TFs show great improvement in the testing prediction accuracy, respectively, over that by the simple energy-independent model; for the same test, no improvement was found by FeatureREDUCE

  • Among the 5 TFs provided by BEEML-PBM, 3 belong to the good-quality PBM group (i.e. TF_27, TF_32, and TF_53; Table 1) where 2 TFs (TF_27 and TF_53) were identified by BayesPI2, and the remaining 2 TFs are in the bad-quality PBM group (i.e. TF_21 and TF_60; Additional file 1: Table S1) where only one TF (TF_21) was recovered by BayesPI2

Read more

Summary

Introduction

Computational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics. A systematic comparison of 26 algorithms to learn TF-DNA binding specificity in in vitro protein-binding microarray (PBM) data was published recently, but the quality of those examined PBMs was not evaluated completely. A comprehensive evaluation of 26 algorithms, for modeling transcription factor (TF) sequence specificity in in vitro protein-binding microarray (PBM) data [1], was published by DREAM5 (the Dialogue for Reverse Engineering Assessments and Methods) consortium. Mononucleotide position weight matrices (PWM) methods perform to more advanced dinucleotide PWM algorithms for modeling TF sequence specificity, and inferred binding energy-level of a motif has little effect on overall prediction accuracy.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call