Sequence-based prediction of putative transcription factor binding sites in DNA sequences of any length.

Wook Lee,Kyungsook Han,Byungkyu Park

doi:10.1109/tcbb.2017.2773075

Abstract

A transcription factor (TF) is a protein that regulates gene expression by binding to specific DNA sequences. Despite the recent advances in experimental techniques for identifying transcription factor binding sites (TFBS) in DNA sequences, a large number of TFBS are to be unveiled in many species. Several computational methods developed for predicting TFBS in DNA are tissue- or species-specific methods, so cannot be used without prior knowledge of tissue or species. Some computational methods are applicable to finding TFBS in short DNA sequences only. In this paper we propose a new learning method for predicting TFBS in DNA of any length using the composition, transition and distribution of nucleotides and amino acids in DNA and TF sequences. In independent testing of the method on datasets that were not used in training the method, its accuracy and MCC were as high as 81.84% and 0.634, respectively. The proposed method can be a useful aid for selecting potential TFBS in a large amount of DNA sequences before conducting biochemical experiments to empirically determine TFBS. The program and data sets are available at http://bclab.inha.ac.kr/TFbinding.

Full Text