Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Highlights
At the beginning of this study, many computational models, which were used to describe transcription factors (TF) binding preference, are proposed based on position weight matrices (PWMs) or motifs[12,13,14,15,16,17,18]
K-mer is considered as a word in the sentence, so DNA sequences are divided into a k-mer series with a specified length and stride window
We propose a bidirectional gated recurrent unit neural network with k-mer embedding to identify TF binding sites from DNA sequence
Summary
At the beginning of this study, many computational models, which were used to describe TF binding preference, are proposed based on position weight matrices (PWMs) or motifs[12,13,14,15,16,17,18]. Babak et al.[28] proposed a model based on deep convolutional neural networks (CNN), named DeepBind, to predict the sequence specificities of DNA- and RNA- binding protein. This model has achieved better performance than other existing methods. These CNN-based models have achieved better performance, but we note that CNN only focus on the current state and cannot capture the influence of previous state and future state on current state To address this problem, Quang et al.[65] proposed a hybrid convolutional and recurrent neural network framework for predicting the function of short DNA sequence. We hope that our method could contribute to the study of DNA sequence modeling and DNA regulatory mechanisms
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.