Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

Juhani Kähärä,Harri Lähdesmäki

doi:10.1186/1471-2105-14-s10-s2

Juhani Kähärä, Harri Lähdesmäki

Open Access

https://doi.org/10.1186/1471-2105-14-s10-s2

Copy DOI

Abstract

Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Highlights

Many proteins bind DNA and do that in a sequence specific way. These DNA-binding proteins include transcription factors (TF), among others, which have an important function in regulating gene expression by affecting transcription and chromatin state
position specific weight matrix (PWM) have been criticized that they might lose some important dependencies between nearby nucleotides, but PWMs provide a very easy and intuitive modeling framework and, thousands of different PWMs exist in several databases [7,8]
The classification using PWMs was conducted by scanning the reads in the testing set with the given model and the maximum of the scores was assigned to that read

Summary

Introduction

Many proteins bind DNA and do that in a sequence specific way These DNA-binding proteins include transcription factors (TF), among others, which have an important function in regulating gene expression by affecting transcription and chromatin state. The DNA preference of DNA-binding proteins can be modelled with different computational methods [2]. All methods require known binding sites or data from biological experiments, such as gene expression profiling, chromatin immunoprecipitation followed by sequencing (ChIP-seq), protein. The number of each base is calculated in each position of the alignment, and each base is assigned a score based on the counts This way each position treats the nucleotides independently from the other positions: the score is based only on the frequency of the base in that certain position. PWMs have been criticized that they might lose some important dependencies between nearby nucleotides, but PWMs provide a very easy and intuitive modeling framework and, thousands of different PWMs exist in several databases [7,8]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 1, 2013
Citations: 21	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Stability selection for regression-based models of transcription factor–DNA binding specificity
Fantine Mordelet ... John Horton
Bioinformatics | VOL. 29
Fantine Mordelet, et. al.Fantine Mordelet ... John Horton
19 Jun 2013
Bioinformatics | VOL. 29

Assessment of Algorithms for Inferring Positional Weight Matrix Motifs of Transcription Factor Binding Sites Using Protein Binding Microarray Data
Yaron Orenstein ... Ron Shamir
PLoS ONE | VOL. 7
Yaron Orenstein, et. al.Yaron Orenstein ... Ron Shamir
28 Sep 2012
PLoS ONE | VOL. 7

Transcription Factors and DNA Regulatory Elements
Martha L Bulyk
The FASEB Journal | VOL. 26
Martha L BulykMartha L Bulyk
01 Apr 2012
The FASEB Journal | VOL. 26

A general approach for discriminative de novo motif discovery from high-throughput data
Jan Grau ... Jens Keilwagen
Nucleic Acids Research | VOL. 41
Jan Grau, et. al.Jan Grau ... Jens Keilwagen
19 Sep 2013
Nucleic Acids Research | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics