Abstract

Obtaining valuable clues for noncoding RNA (ribonucleic acid) subsequences remains a significant challenge, acknowledging that most of the human genome transcribes into noncoding RNA parts related to unknown biological operations. Capturing these clues relies on accurate “base pairing” prediction, also known as “RNA secondary structure prediction”. As COVID-19 is considered a severe global threat, the single-stranded SARS-CoV-2 virus reveals the importance of establishing an efficient RNA analysis toolkit. This work aimed to contribute to that by introducing a novel system committed to predicting RNA secondary structure patterns (i.e., RNA’s pseudoknots) that leverage syntactic pattern-recognition strategies. Having focused on the pseudoknot predictions, we formalized the secondary structure prediction of the RNA to be primarily a parsing and, secondly, an optimization problem. The proposed methodology addresses the problem of predicting pseudoknots of the first order (H-type). We introduce a context-free grammar (CFG) that affords enough expression power to recognize potential pseudoknot pattern. In addition, an alternative methodology of detecting possible pseudoknots is also implemented as well, using a brute-force algorithm. Any input sequence may highlight multiple potential folding patterns requiring a strict methodology to determine the single biologically realistic one. We conscripted a novel heuristic over the widely accepted notion of free-energy minimization to tackle such ambiguity in a performant way by utilizing each pattern’s context to unveil the most prominent pseudoknot pattern. The overall process features polynomial-time complexity, while its parallel implementation enhances the end performance, as proportional to the deployed hardware. The proposed methodology does succeed in predicting the core stems of any RNA pseudoknot of the test dataset by performing a 76.4% recall ratio. The methodology achieved a F1-score equal to 0.774 and MCC equal 0.543 in discovering all the stems of an RNA sequence, outperforming the particular task. Measurements were taken using a dataset of 262 RNA sequences establishing a performance speed of 1.31, 3.45, and 7.76 compared to three well-known platforms. The implementation source code is publicly available under knotify github repo.

Highlights

  • The RNA molecule, being the intermediate representation of the information flowing from DNA to proteins, holds a crucial role in many biological processes

  • Non-coding RNAs are functional RNA molecules transcripted from DNA but not translated into proteins

  • The precision, the recall, the F1-score, and the Matthews correlation coefficient (MCC) metrics per platform are exhibited for the four groups of different RNA sequences’ length. In these tables it is shown that our methodology outperformed on average all methods in regards to the precision metric for all ranges of length, while Knotty outperformed our methodology in regards to the F1-score and MCC metrics mainly when RNA sequences were of larger size

Read more

Summary

Introduction

The RNA molecule, being the intermediate representation of the information flowing from DNA to proteins, holds a crucial role in many biological processes. Non-coding RNAs (ncRNAs) are functional RNA molecules transcripted from DNA but not translated into proteins. The latter must not be misinterpreted as not enclosing important information or contributing to any biological operation. Current evidence implies that ncRNAs transcribe most of the genomes of mammals and other complex bions, to contradict the widespread assumption that proteins transcribe most genetic information. Their purpose is to fulfill diverse catalytic and structural functions, along with regulating gene expressions at the transcriptional and post-transcriptional level

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call