Prediction of Transcription Factor Binding Sites with Suffix Arrays

Myung Eun Lim ,Jeong Seop Sim ,Myung Geun Chung ,Sun Hee Park

doi:10.11234/gi1990.14.400

Abstract

Suffix trees and suffix arrays are very important index data structures in diverse applications in string processing and computational biology. Despite simplicity of suffix arrays, suffix trees have been the most fundamental index data structures in the literature because suffix arrays were inferior to suffix trees in the two aspects, construction time and search time. But recently, there have been vigorous works on suffix arrays, and suffix arrays are proved to be as powerful as suffix trees [1, 2, 3, 4]. The availability of the whole genome sequences of human due to the Human Genome Project makes it possible to study gene function much more effectively. We can find or predict the functions and positions of genes by analyzing the transcriptional regulation. In general, there are three issues in the field of transcriptional regulation, i) transcription factors (TF for short), ii) transcription factor binding sites, and iii) regulatory proteins. In this poster, we focus on the second issue and suggest an algorithm of predicting TF binding sites from a given set of sequences of upstream regions of genes (transcriptional regulation regions). For this, we need two assumptions. First, we assume that there tends to be a common TF that binds with each of the upstream regions of the functionally related genes. Our second assumption is that a TF binds with similar DNA sequences in different genes or different organs.

Full Text