Abstract

Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.

Highlights

  • Transcription factors are group of proteins that participate in gene regulation by binding to specific short DNA sequences, known as transcription factor binding sites (TFBS)

  • Accurate identification of the TFBSs is the first and perhaps the most critical step in modeling the gene regulatory mechanisms from datasets generated by recent high-throughput approaches, such as ChIPSeq/chip [1]

  • Tree-based PWM (TPWM) is inspired by maximal dependence decomposition (MDD) but has been augmented by a number of critical modifications that make it suitable for modelling TFBS

Read more

Summary

Introduction

Transcription factors are group of proteins that participate in gene regulation by binding to specific short DNA sequences, known as transcription factor binding sites (TFBS). Accurate identification of the TFBSs is the first and perhaps the most critical step in modeling the gene regulatory mechanisms from datasets generated by recent high-throughput approaches, such as ChIPSeq/chip [1]. TFBSs are usually short and degenerated at multiple positions. Numerous computational approaches to predict the TFBSs have been proposed in recent years, the high false positive rate is still a problem. Depending on the representation of the TFBSs, the computational prediction methods fall into three broad classes: the PWM-based approaches [7,8], consensus sequences-based or regular expressions-based approaches [9,10,11] and feature-based methods [12,13]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call