Abstract

Accurate predictions of RNA secondary structures can help uncover the roles of functional non-coding RNAs. Although machine learning-based models have achieved high performance in terms of prediction accuracy, overfitting is a common risk for such highly parameterized models. Here we show that overfitting can be minimized when RNA folding scores learnt using a deep neural network are integrated together with Turner’s nearest-neighbor free energy parameters. Training the model with thermodynamic regularization ensures that folding scores and the calculated free energy are as close as possible. In computational experiments designed for newly discovered non-coding RNAs, our algorithm (MXfold2) achieves the most robust and accurate predictions of RNA secondary structures without sacrificing computational efficiency compared to several other algorithms. The results suggest that integrating thermodynamic information could help improve the robustness of deep learning-based predictions of RNA secondary structure.

Highlights

  • Accurate predictions of RNA secondary structures can help uncover the roles of functional non-coding RNAs

  • Inspired by MXfold and the deep neural networks (DNNs)-based RNA secondary structure prediction methods, in this paper, we propose an algorithm for predicting RNA secondary structures using deep learning

  • These results indicate that MXfold[2] (F = 0.693) achieved the best accuracy, followed by the trainable methods, namely, MXfold (F = 0.673), TORNADO (F = 0.664 at γ = 4.0), CONTRAfold (F = 0.658 at γ = 4.0), and ContextFold (F = 0.651), and MXfold[2] outperformed the thermodynamics-based methods (p < 0.001, one-sided Wilcoxon singed-rank test)

Read more

Summary

Introduction

Accurate predictions of RNA secondary structures can help uncover the roles of functional non-coding RNAs. An alternative approach utilizes machine learning techniques, which train scoring parameters for decomposed substructures from reference structures, rather than experimental techniques. This approach has successfully been adopted by CONTRAfold[12,13],. Rich parameterization can cause overfitting to the training data, preventing robust predictions for a wide range of RNA sequences[15] Probabilistic generative models such as stochastic context-free grammars (SCFGs) have been applied to predicting RNA secondary structures. TORNADO15 implemented an application of SCFGs to the nearest-neighbor model, achieving performance comparable with that of other machine learning-based methods

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call