Abstract

Precise identification of target sites of RNA-binding proteins (RBP) is important to understand their biochemical and cellular functions. A large amount of experimental data is generated by in vivo and in vitro approaches. The binding preferences determined from these platforms share similar patterns but there are discernable differences between these datasets. Computational methods trained on one dataset do not always work well on another dataset. To address this problem which resembles the classic “domain shift” in deep learning, we adopted the adversarial domain adaptation (ADDA) technique and developed a framework (RBP-ADDA) that can extract RBP binding preferences from an integration of in vivo and vitro datasets. Compared with conventional methods, ADDA has the advantage of working with two input datasets, as it trains the initial neural network for each dataset individually, projects the two datasets onto a feature space, and uses an adversarial framework to derive an optimal network that achieves an optimal discriminative predictive power. In the first step, for each RBP, we include only the in vitro data to pre-train a source network and a task predictor. Next, for the same RBP, we initiate the target network by using the source network and use adversarial domain adaptation to update the target network using both in vitro and in vivo data. These two steps help leverage the in vitro data to improve the prediction on in vivo data, which is typically challenging with a lower signal-to-noise ratio. Finally, to further take the advantage of the fused source and target data, we fine-tune the task predictor using both data. We showed that RBP-ADDA achieved better performance in modeling in vivo RBP binding data than other existing methods as judged by Pearson correlations. It also improved predictive performance on in vitro datasets. We further applied augmentation operations on RBPs with less in vivo data to expand the input data and showed that it can improve prediction performances. Lastly, we explored the predictive interpretability of RBP-ADDA, where we quantified the contribution of the input features by Integrated Gradients and identified nucleotide positions that are important for RBP recognition.

Highlights

  • RNA-binding proteins (RBPs) have important roles in all aspects of post-transcriptional gene regulation including splicing, polyadenylation, transport, translation, and degradation of RNA transcripts [1]

  • Because of the intrinsic differences between in vitro and in vivo experimental conditions, the binding preferences determined from in vitro and in vivo do not always agree with each other. To solve this problem and best utilize both types of data, we have adopted the adversarial domain adaptation (ADDA) technique into the analysis of RNA binding proteins and developed a framework (RBP-ADDA) that can extract RBP binding preferences from an integration of in vivo and vitro datasets

  • We showed that RBP-ADDA outperforms other contemporary methods in predicting RBA binding preferences on both in vivo and in vitro data

Read more

Summary

Introduction

RNA-binding proteins (RBPs) have important roles in all aspects of post-transcriptional gene regulation including splicing, polyadenylation, transport, translation, and degradation of RNA transcripts [1]. Several experimental and computational platforms had been developed over the years to determine and model the binding preferences between RBPs and RNAs [4]. In vitro methods such as RNAcompete incubate protein with synthesized RNA fragments (typically 30–41 nucleotides long) and determine the identity of bound RNA sequence motifs by sequencing or microarray [10–12]. With the success of these experimental approaches, several computational methods had been developed with the goals of helping understand the binding preference from a structural and sequence perspective and building an accurate predictive model to infer binding affinities of other RBPs [13–19]. The developers of GraphProt have constructed a representative dataset by extending 150 nucleotides in both directions on the binding sites determined in CLIP-seq; this positive dataset has been widely used to train deep learning (DL) based models such as iDeepE [20]. Ghanbari and Ohler recently proposed a multi-task and multimodal deep neural network to infer RBP binding sites by considering region types of the binding sites [19]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call