Base-resolution prediction of transcription factor binding signals by a deep learning framework.

Qinhu Zhang,Zhen Cui,Qi Liu,Zhanheng Chen,Zhenhao Guo,Siguo Wang,De-Shuang Huang,Ying He,Inna Lavrik

doi:10.1371/journal.pcbi.1009941

Abstract

Transcription factors (TFs) play an important role in regulating gene expression, thus the identification of the sites bound by them has become a fundamental step for molecular and cellular biology. In this paper, we developed a deep learning framework leveraging existing fully convolutional neural networks (FCN) to predict TF-DNA binding signals at the base-resolution level (named as FCNsignal). The proposed FCNsignal can simultaneously achieve the following tasks: (i) modeling the base-resolution signals of binding regions; (ii) discriminating binding or non-binding regions; (iii) locating TF-DNA binding regions; (iv) predicting binding motifs. Besides, FCNsignal can also be used to predict opening regions across the whole genome. The experimental results on 53 TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets show that our proposed framework outperforms some existing state-of-the-art methods. In addition, we explored to use the trained FCNsignal to locate all potential TF-DNA binding regions on a whole chromosome and predict DNA sequences of arbitrary length, and the results show that our framework can find most of the known binding regions and accept sequences of arbitrary length. Furthermore, we demonstrated the potential ability of our framework in discovering causal disease-associated single-nucleotide polymorphisms (SNPs) through a series of experiments.

Highlights

Transcription factors (TFs) can activate or suppress the transcription of genes by binding to specific DNA non-coding regions, thereby playing an integral role in gene expression [1]
With the development of high-throughput sequencing technologies and deep learning (DL), several DL-based approaches have been developed for systematically studying transcription factor binding sites (TFBSs), achieving impressive performance
We provide an integrated framework, which utilizes the fully convolutional neural networks (FCN) architecture to predict TF-DNA binding signals at the base-resolution level, to simultaneously study multiple TFBSs-associated tasks

Summary

Introduction

Transcription factors (TFs) can activate or suppress the transcription of genes by binding to specific DNA non-coding regions, thereby playing an integral role in gene expression [1]. SMiLE-seq [11] is a newly-developed technology for protein– DNA interaction characterization that can efficiently characterize DNA binding specificities of TF monomers, homodimers and heterodimers. These binding data provide an unprecedented opportunity for us to develop computational approaches to predict TFBSs and motifs. Gkm-SVM [14,15] detected functional regulatory elements in DNA sequences by using gapped kmer and support vector machine These methods are often subject to the defects of low efficiency and poor performance

Methods

Results

Discussion

Conclusion