Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

Byungkyu Park,Jinyong Im,Narankhuu Tuvshinjargal,Wook Lee,Kyungsook Han

doi:10.1016/j.cmpb.2014.07.009

Abstract

As many structures of protein–DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein–DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

Abstract

Talk to us

Similar Papers

More From: Computer Methods and Programs in Biomedicine

Lead the way for us

Journal: Computer Methods and Programs in Biomedicine	Publication Date: Aug 1, 2014
Citations: 31

Similar Papers

PNImodeler: web server for inferring protein-binding nucleotides from sequence data.
Jinyong Im ... Narankhuu Tuvshinjargal
BMC genomics | VOL. Suppl 16 3
Jinyong Im, et. al.Jinyong Im ... Narankhuu Tuvshinjargal
29 Jan 2015
BMC genomics | VOL. Suppl 16 3

Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets
Sungwook Choi ... Kyungsook Han
Computers in Biology and Medicine | VOL. 43
Sungwook Choi, et. al.Sungwook Choi ... Kyungsook Han
21 Aug 2013
Computers in Biology and Medicine | VOL. 43

Predicting protein-binding RNA nucleotides with consideration of binding partners
Narankhuu Tuvshinjargal ... Kyungsook Han
Computer Methods and Programs in Biomedicine | VOL. 120
Narankhuu Tuvshinjargal, et. al.Narankhuu Tuvshinjargal ... Kyungsook Han
08 Apr 2015
Computer Methods and Programs in Biomedicine | VOL. 120

Prediction of RNA-binding amino acids from protein and RNA sequences
Sungwook Choi ... Kyungsook Han
BMC Bioinformatics | VOL. 12
Sungwook Choi, et. al.Sungwook Choi ... Kyungsook Han
30 Nov 2011
BMC Bioinformatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

Abstract

Talk to us

Similar Papers

More From: Computer Methods and Programs in Biomedicine