Optimized Position Weight Matrices in Prediction of Novel Putative Binding Sites for Transcription Factors in the Drosophila melanogaster Genome

Vyacheslav Y Morozov,Ilya P Ioshikhes,Dmitry I Nurminsky

doi:10.1371/journal.pone.0068712

Vyacheslav Y Morozov, Ilya P Ioshikhes + Show 1 more

Open Access

PDF Available

https://doi.org/10.1371/journal.pone.0068712

Copy DOI

Export

Save

Cite

Journal: PLoS ONE	Publication Date: Aug 6, 2013
Citations: 1	License type: CC BY 4.0

Affiliation: University of Ottawa

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. DNA-binding proteins often show degeneracy in their binding requirement and thus the overall binding specificity of many proteins is unknown and remains an active area of research. Although existing PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. Our previous study introduced a promising approach to PWM refinement in which known motifs are used to computationally mine putative binding sites directly from aligned promoter regions using composition of similar sites. In the present study, we extended this technique originally tested on single examples of transcription factors (TFs) and showed its capability to optimize PWM performance to predict new binding sites in the fruit fly genome. We propose refined PWMs in mono- and dinucleotide versions similarly computed for a large variety of transcription factors of Drosophila melanogaster. Along with the addition of many auxiliary sites the optimization includes variation of the PWM motif length, the binding sites location on the promoters and the PWM score threshold. To assess the predictive performance of the refined PWMs we compared them to conventional TRANSFAC and JASPAR sources. The results have been verified using performed tests and literature review. Overall, the refined PWMs containing putative sites derived from real promoter content processed using optimized parameters had better general accuracy than conventional PWMs.

Highlights

Transcription Factors (TFs) play a crucial role in gene regulation, usually binding to DNA through recognizing certain motifs in one or two strands of DNA adjacent to the regulated gene
Results of our computational experiments show that optimized matrices can successfully detect binding sites on a test data set constructed independently of the training data sets
This demonstrates that our machine learning approach resulted in Position weight matrices (PWMs) with better predictive performance than generic TRANSFAC matrices

Summary

Introduction

Transcription Factors (TFs) play a crucial role in gene regulation, usually binding to DNA through recognizing certain motifs in one or two strands of DNA adjacent to the regulated gene. The prediction of TF binding sites (TFBSs) is a big challenge for computational biologists, as increasing amounts of sequence data become available. Direct experimental investigations of TF-DNA binding are still rather time-consuming and labour-intensive. Position weight matrices (PWMs) became essential computational tool and model of choice to describe sequence binding specificities of particular TF-DNA interactions. The relatively short (5–15 nt) sequence motifs are recognized by TFs whose sequence specificities are not very strict [1,2]. The variability in the binding sites of a single factor and molecular mechanisms underlying these variations are not well understood

Objectives

Methods

Results

Conclusion