Recurrent Neural Network for Predicting Transcription Factor Binding Sites

Zhen Shen,De-Shuang Huang,Wenzheng Bao

doi:10.1038/s41598-018-33321-1

Zhen Shen, De-Shuang Huang + Show 1 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-018-33321-1

Copy DOI

Export

Save

Cite

Journal: Scientific Reports	Publication Date: Oct 15, 2018
Citations: 176	License type: open-access

Affiliation: Tongji University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.

Highlights

At the beginning of this study, many computational models, which were used to describe transcription factors (TF) binding preference, are proposed based on position weight matrices (PWMs) or motifs[12,13,14,15,16,17,18]
K-mer is considered as a word in the sentence, so DNA sequences are divided into a k-mer series with a specified length and stride window
We propose a bidirectional gated recurrent unit neural network with k-mer embedding to identify TF binding sites from DNA sequence

Summary

Introduction

At the beginning of this study, many computational models, which were used to describe TF binding preference, are proposed based on position weight matrices (PWMs) or motifs[12,13,14,15,16,17,18]. Babak et al.[28] proposed a model based on deep convolutional neural networks (CNN), named DeepBind, to predict the sequence specificities of DNA- and RNA- binding protein. This model has achieved better performance than other existing methods. These CNN-based models have achieved better performance, but we note that CNN only focus on the current state and cannot capture the influence of previous state and future state on current state To address this problem, Quang et al.[65] proposed a hybrid convolutional and recurrent neural network framework for predicting the function of short DNA sequence. We hope that our method could contribute to the study of DNA sequence modeling and DNA regulatory mechanisms

Methods

Results

Conclusion