Abstract

Branchpoints (BPs) are essential sequence elements of ribonucleic acids (RNAs) in splicing, which is the process of creating a messenger RNA (mRNA) that is translated into proteins. This study proposes to develop deep neural networks for BP prediction. Extensive previous studies have shown that the existence of BP sites depends on sequence patterns called motifs; hence, the prediction model must accurately explain its decisions in terms of motifs. Existing approaches utilized either handcrafted features for interpretable, though less accurate, predictions or deep neural networks that were accurate but difficult to explain. To address the aforementioned difficulties, the proposed method incorporates 1) generative adversarial networks (GANs) to learn the latent structure of RNA sequences, and 2) an attention mechanism to learn sequence-positional long-term dependency for accurate prediction and interpretation. Our method achieves highly satisfying results in various performance metrics with adequate interpretability. We demonstrated that, without any prior biological knowledge, BP prediction by the proposed method is closely related to three motifs, the consensus sequence surrounding BPs, polypyrimidine tract, and 3' splice site, that are well-established in molecular biology.

Highlights

  • The human body has numerous types of cells, such as blood cells, neurons, and liver cells

  • The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang . 1DNA is a long string of paired chemical molecules called nucleotides that are of four types denoted by A, C, G, and T [1]

  • Recent efforts have enriched the lariats in RNA sequencing, building large-scale datasets of genome-wide BP annotations [4], [5]; only 40% of the entire human introns were covered by annotation of almost 130,000 human BPs. In view of these difficulties, we propose a deep neural network model to predict the BP for a given RNA sequence taking into consideration the following: First, BP sites are known to co-exist with motifs or sequence patterns of up to tens of nucleotides that are typically not readable by humans and difficult to identify; a few ultra-conservative sequence motifs were observed experimentally [6]

Read more

Summary

INTRODUCTION

The human body has numerous types of cells, such as blood cells, neurons, and liver cells. The existing approaches were constrained as they relied on either interpretable but inaccurate handcrafted sequential features or deep neural networks of poor interpretability [3], [8]–[12] We address these issues by using generative adversarial networks (GANs) [13], [14] to learn the latent structure of RNA sequences for BP prediction, which we call BP-GAN hereafter. Encouraged by its success, a number of recent studies in genomics incorporated attention into various prediction models on RNA-protein binding sites [25], gene expression analysis [1], and precursor microRNAs [26] These approaches primarily focused on attention at the level of hundreds of nucleotides [1], or used recurrent neural networks (RNNs) [26] to learn long-term dependencies between sequence elements [24].

SEQUENCE-BASED POSITIONAL SELF-ATTENTION
MOTIF INTERPRETATION FROM ATTENTION
TRAINING THE BP-GAN END-TO-END
RESULTS AND DISCUSSION
3) EVALUATION OF PREDICTION OF MULTIPLE BPs
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call