Abstract

Accurate RNA secondary structure information is the cornerstone of gene function research and RNA tertiary structure prediction. However, most traditional RNA secondary structure prediction algorithms are based on the dynamic programming (DP) algorithm, according to the minimum free energy theory, with both hard and soft constraints. The accuracy is particularly dependent on the accuracy of soft constraints (from experimental data like chemical and enzyme detection). With the elongation of the RNA sequence, the time complexity of DP-based algorithms will increase geometrically, as a result, they are not good at coping with relatively long sequences. Furthermore, due to the complexity of the pseudoknots structure, the secondary structure prediction method, based on traditional algorithms, has great defects which cannot predict the secondary structure with pseudoknots well. Therefore, few algorithms have been available for pseudoknots prediction in the past. The ATTfold algorithm proposed in this article is a deep learning algorithm based on an attention mechanism. It analyzes the global information of the RNA sequence via the characteristics of the attention mechanism, focuses on the correlation between paired bases, and solves the problem of long sequence prediction. Moreover, this algorithm also extracts the effective multi-dimensional features from a great number of RNA sequences and structure information, by combining the exclusive hard constraints of RNA secondary structure. Hence, it accurately determines the pairing position of each base, and obtains the real and effective RNA secondary structure, including pseudoknots. Finally, after training the ATTfold algorithm model through tens of thousands of RNA sequences and their real secondary structures, this algorithm was compared with four classic RNA secondary structure prediction algorithms. The results show that our algorithm significantly outperforms others and more accurately showed the secondary structure of RNA. As the data in RNA sequence databases increase, our deep learning-based algorithm will have superior performance. In the future, this kind of algorithm will be more indispensable.

Highlights

  • RNA is an indispensable biopolymer that plays diverse biological roles in regulating translation (Kapranov et al, 2007), gene expression (Storz and Gottesman, 2006), and RNA splicing (Sharp, 2009)

  • The prediction of the RNA secondary structure has gradually fallen into a bottleneck in traditional algorithm research over the past 40 years

  • With the rapid development of deep learning and machine learning, the Method tRNA F1-score positive predictive value (PPV)

Read more

Summary

Introduction

RNA is an indispensable biopolymer that plays diverse biological roles in regulating translation (Kapranov et al, 2007), gene expression (Storz and Gottesman, 2006), and RNA splicing (Sharp, 2009). To accurately obtain the RNA secondary structure, different prediction algorithms have been developed over the past 40 years. The most mainstream calculation method is the Nearest Neighbor Thermodynamic Model (NNTM) based on a single RNA sequence (Turner and Mathews, 2010). This method calculates the RNA secondary structure with minimum free energy (MFE) through the dynamic programming algorithm. As for classic algorithms, they only focus on the number of pairing bases in the sequence, while ignoring the exact base pairs Such algorithms perform well in terms of the prediction accuracy, they deliver poor reports to describe the true RNA secondary structures. The thermodynamic matcher is still a very general framework used to solve the hard constraints of RNA secondary structure (Reeder and Giegerich, 2004)

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call