LncMiRPath: A Transformer-based Deep Learning Framework for lncRNAs and miRNAs Interaction Prediction
Introduction/Objective: Interactions between long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) play a critical role in gene regulation and disease mechanisms. However, most existing prediction models rely solely on sequence features, overlooking RNA secondary structures that are essential for accurate interaction prediction. This study introduces LncMiRPath, a Transformer-based framework that integrates both sequence and structural information to enhance predictive performance. Methods: We developed LncMiRPath using a dual-input Transformer architecture that incorporates lncRNA and miRNA sequences alongside their predicted secondary structures. Datasets were obtained from LncBase v3, ENCORI, and miRcode. Secondary structures were inferred using IPknot and represented in dot-bracket notation. We compared three model variants—sequence-only, structure-only, and combined models—using accuracy, precision, recall, and area under the curve (AUC) as performance metrics. Results: LncMiRPath outperformed all baseline models, achieving an AUC of 95% on the curated dataset, demonstrating the effectiveness of integrating structural features. On the independent LncRNASNPv2 dataset, the model maintained strong generalization capability with an AUC of ~91%. Discussion: These results underscore the importance of incorporating RNA secondary structure, a factor often neglected in previous studies. By capturing complementary sequence and structural signals, LncMiRPath not only improves prediction accuracy but also enhances biological interpretability. Although structure inference relies on computational tools such as IPknot, consistent performance across multiple datasets supports the robustness and translational potential of the proposed approach. Future validation with experimental structure data may further strengthen the model. Conclusion: LncMiRPath represents a robust and biologically informed framework for predicting lncRNA–miRNA interactions by jointly leveraging sequence and structural features. This approach advances RNA computational biology and provides a promising tool for RNA-based therapeutic research.
- Research Article
53
- 10.1006/jmbi.1995.0608
- Nov 1, 1995
- Journal of Molecular Biology
An Interactive Framework for RNA Secondary Structure Prediction with a Dynamical Treatment of Constraints
- Research Article
33
- 10.1186/s12864-020-07239-w
- Dec 1, 2020
- BMC Genomics
BackgroundRNA binding proteins (RBPs) play a vital role in post-transcriptional processes in all eukaryotes, such as splicing regulation, mRNA transport, and modulation of mRNA translation and decay. The identification of RBP binding sites is a crucial step in understanding the biological mechanism of post-transcriptional gene regulation. However, the determination of RBP binding sites on a large scale is a challenging task due to high cost of biochemical assays. Quite a number of studies have exploited machine learning methods to predict binding sites. Especially, deep learning is increasingly used in the bioinformatics field by virtue of its ability to learn generalized representations from DNA and protein sequences.ResultsIn this paper, we implemented a novel deep neural network model, DeepRKE, which combines primary RNA sequence and secondary structure information to effectively predict RBP binding sites. Specifically, we used word embedding algorithm to extract features of RNA sequences and secondary structures, i.e., distributed representation of k-mers sequence rather than traditional one-hot encoding. The distributed representations are taken as input of convolutional neural networks (CNN) and bidirectional long-term short-term memory networks (BiLSTM) to identify RBP binding sites. Our results show that deepRKE outperforms existing counterpart methods on two large-scale benchmark datasets.ConclusionsOur extensive experimental results show that DeepRKE is an efficacious tool for predicting RBP binding sites. The distributed representations of RNA sequences and secondary structures can effectively detect the latent relationship and similarity between k-mers, and thus improve the predictive performance. The source code of DeepRKE is available at https://github.com/youzhiliu/DeepRKE/.
- Research Article
26
- 10.1186/s12864-018-5275-8
- Dec 1, 2018
- BMC Genomics
BackgroundWith the increasing number of annotated long noncoding RNAs (lncRNAs) from the genome, researchers are continually updating their understanding of lncRNAs. Recently, thousands of lncRNAs have been reported to be associated with ribosomes in mammals. However, their biological functions or mechanisms are still unclear.ResultsIn this study, we tried to investigate the sequence features involved in the ribosomal association of lncRNA. We have extracted ninety-nine sequence features corresponding to different biological mechanisms (i.e., RNA splicing, putative ORF, k-mer frequency, RNA modification, RNA secondary structure, and repeat element). An mathcal {L}1-regularized logistic regression model was applied to screen these features. Finally, we obtained fifteen and nine important features for the ribosomal association of human and mouse lncRNAs, respectively.ConclusionTo our knowledge, this is the first study to characterize ribosome-associated lncRNAs and ribosome-free lncRNAs from the perspective of sequence features. These sequence features that were identified in this study may shed light on the biological mechanism of the ribosomal association and provide important clues for functional analysis of lncRNAs.
- Research Article
12
- 10.1186/s12864-018-4497-0
- Feb 15, 2018
- BMC Genomics
BackgroundRNA is known to play diverse roles in gene regulation. The clues for this regulatory function of RNA are embedded in its ability to fold into intricate secondary and tertiary structure.ResultsWe report the transcriptome-wide RNA secondary structure in zebrafish at single nucleotide resolution using Parallel Analysis of RNA Structure (PARS). This study provides the secondary structure map of zebrafish coding and non-coding RNAs. The single nucleotide pairing probabilities of 54,083 distinct transcripts in the zebrafish genome were documented. We identified RNA secondary structural features embedded in functional units of zebrafish mRNAs. Translation start and stop sites were demarcated by weak structural signals. The coding regions were characterized by the three-nucleotide periodicity of secondary structure and display a codon base specific structural constrain. The splice sites of transcripts were also delineated by distinct signature signals. Relatively higher structural signals were observed at 3’ Untranslated Regions (UTRs) compared to Coding DNA Sequence (CDS) and 5’ UTRs. The 3′ ends of transcripts were also marked by unique structure signals. Secondary structural signals in long non-coding RNAs were also explored to better understand their molecular function.ConclusionsOur study presents the first PARS-enabled transcriptome-wide secondary structure map of zebrafish, which documents pairing probability of RNA at single nucleotide precision. Our findings open avenues for exploring structural features in zebrafish RNAs and their influence on gene expression.
- Research Article
- 10.18454/jbg.2020.1.13.1
- Jan 23, 2020
- SHILAP Revista de lepidopterología
The structure of noncoding RNAs largely determines their functions. With the rapid growth of experimental data on the RNA secondary structures, the task of predicting its spatial structure becomes the most urgent task of RNA bioinformatics. The ability to predict tertiary base pairs from data on the secondary structure could significantly reduce the operating time and improve the quality of the RNA spatial structure prediction algorithms. In this work, we applied the machine learning algorithm for the problem of RNA tertiary base pairs prediction from data on the RNA sequence and secondary structure. A group of local base pairs was identified that can be predicted with high quality (80% precision, 80% recall). It was also shown that more than 70% of all long-range noncanonical base pairs in RNA are the base pairs of geometric classes Sugar-Edge/Sugar-Edge and Sugar-Edge/Watson-Crick-Edge that correspond to ribose zipper and A-minor tertiary motifs.
- Research Article
43
- 10.1186/1471-2105-11-s6-s21
- Oct 1, 2010
- BMC Bioinformatics
BackgroundDetermining the secondary structure of RNA from the primary structure is a challenging computational problem. A number of algorithms have been developed to predict the secondary structure from the primary structure. It is agreed that there is still room for improvement in each of these approaches. In this work we build a predictive model for secondary RNA structure using a graph-theoretic tree representation of secondary RNA structure. We model the bonding of two RNA secondary structures to form a larger secondary structure with a graph operation we call merge. We consider all combinatorial possibilities using all possible tree inputs, both those that are RNA-like in structure and those that are not. The resulting data from each tree merge operation is represented by a vector. We use these vectors as input values for a neural network and train the network to recognize a tree as RNA-like or not, based on the merge data vector. The network estimates the probability of a tree being RNA-like.ResultsThe network correctly assigned a high probability of RNA-likeness to trees previously identified as RNA-like and a low probability of RNA-likeness to those classified as not RNA-like. We then used the neural network to predict the RNA-likeness of the unclassified trees.ConclusionsThere are a number of secondary RNA structure prediction algorithms available online. These programs are based on finding the secondary structure with the lowest total free energy. In this work, we create a predictive tool for secondary RNA structures using graph-theoretic values as input for a neural network. The use of a graph operation to theoretically describe the bonding of secondary RNA is novel and is an entirely different approach to the prediction of secondary RNA structures. Our method correctly predicted trees to be RNA-like or not RNA-like for all known cases. In addition, our results convey a measure of likelihood that a tree is RNA-like or not RNA-like. Given that the majority of secondary RNA folding algorithms return more than one possible outcome, our method provides a means of determining the best or most likely structures among all of the possible outcomes.
- Research Article
44
- 10.1128/jvi.00701-20
- Nov 23, 2020
- Journal of Virology
Chikungunya virus (CHIKV) is a mosquito-borne alphavirus associated with debilitating arthralgia in humans. RNA secondary structure in the viral genome plays an important role in the lifecycle of alphaviruses; however, the specific role of RNA structure in regulating CHIKV replication is poorly understood. Our previous studies found little conservation in RNA secondary structure between alphaviruses, and this structural divergence creates unique functional structures in specific alphavirus genomes. Therefore, to understand the impact of RNA structure on CHIKV biology, we used SHAPE-MaP to inform the modeling of RNA secondary structure throughout the genome of a CHIKV isolate from the 2013 Caribbean outbreak. We then analyzed regions of the genome with high levels of structural specificity to identify potentially functional RNA secondary structures and identified 23 regions within the CHIKV genome with higher than average structural stability, including four previously identified, functionally important CHIKV RNA structures. We also analyzed the RNA flexibility and secondary structures of multiple 3'UTR variants of CHIKV that are known to affect virus replication in mosquito cells. This analysis found several novel RNA structures within these 3'UTR variants. A duplication in the 3'UTR that enhances viral replication in mosquito cells led to an overall increase in the amount of unstructured RNA in the 3'UTR. This analysis demonstrates that the CHIKV genome contains a number of unique, specific RNA secondary structures and provides a strategy for testing these secondary structures for functional importance in CHIKV replication and pathogenesis.IMPORTANCE Chikungunya virus (CHIKV) is a mosquito-borne RNA virus that causes febrile illness and debilitating arthralgia in humans. CHIKV causes explosive outbreaks but there are no approved therapies to treat or prevent CHIKV infection. The CHIKV genome contains functional RNA secondary structures that are essential for proper virus replication. Since RNA secondary structures have only been defined for a small portion of the CHIKV genome, we used a chemical probing method to define the RNA secondary structures of CHIKV genomic RNA. We identified 23 highly specific structured regions of the genome, and confirmed the functional importance of one structure using mutagenesis. Furthermore, we defined the RNA secondary structure of three CHIKV 3'UTR variants that differ in their ability to replicate in mosquito cells. Our study highlights the complexity of the CHIKV genome and describes new systems for designing compensatory mutations to test the functional relevance of viral RNA secondary structures.
- Research Article
9
- 10.2174/1574893615999200724145835
- Feb 1, 2021
- Current Bioinformatics
Background: Epigenetic repression mechanisms play an important role in gene regulation, specifically in cancer development. In many cases, a CpG island’s (CGI) susceptibility or resistance to methylation is shown to be contributed by local DNA sequence features. Objective: To develop unbiased machine learning models–individually and combined for different biological features–that predict the methylation propensity of a CGI. Methods: We developed our model consisting of CGI sequence features on a dataset of 75 sequences (28 prone, 47 resistant) representing a genome-wide methylation structure. We tested our model on two independent datasets that are chromosome (132 sequences) and disease (70 sequences) specific. Results: We provided improvements in prediction accuracy over previous models. Our results indicate that combined features better predict the methylation propensity of a CGI (area under the curve (AUC) ~0.81). Our global methylation classifier performs well on independent datasets reaching an AUC of ~0.82 for the complete model and an AUC of ~0.88 for the model using select sequences that better represent their classes in the training set. We report certain de novo motifs and transcription factor binding site (TFBS) motifs that are consistently better in separating prone and resistant CGIs. Conclusion: Predictive models for the methylation propensity of CGIs lead to a better understanding of disease mechanisms and can be used to classify genes based on their tendency to contain methylation prone CGIs, which may lead to preventative treatment strategies. MATLAB® and Python™ scripts used for model building, prediction, and downstream analyses are available at https://github.com/dicleyalcin/methylProp_predictor.
- Research Article
5
- 10.1021/bi3001227
- Jun 21, 2012
- Biochemistry
To better elucidate RNA structure-function relationships and to improve the design of pharmaceutical agents that target specific RNA motifs, an understanding of RNA primary, secondary, and tertiary structure is necessary. The prediction of RNA secondary structure from sequence is an intermediate step in predicting RNA three-dimensional structure. RNA secondary structure is typically predicted using a nearest neighbor model based on free energy parameters. The current free energy parameters for 2 × 3 nucleotide loops are based on a 23-member data set of 2 × 3 loops and internal loops of other sizes. A database of representative RNA secondary structures was searched to identify 2 × 3 nucleotide loops that occur in nature. Seventeen of the most frequent 2 × 3 nucleotide loops in this database were studied by optical melting experiments. Fifteen of these loops melted in a two-state manner, and the associated experimental ΔG°(37,2×3) values are, on average, 0.6 and 0.7 kcal/mol different from the values predicted for these internal loops using the predictive models proposed by Lu, Turner, and Mathews [Lu, Z. J., Turner, D. H., and Mathews, D. H. (2006) Nucleic Acids Res. 34, 4912-4924] and Chen and Turner [Chen, G., and Turner, D. H. (2006) Biochemistry 45, 4025-4043], respectively. These new ΔG°(37,2×3) values can be used to update the current algorithms that predict secondary structure from sequence. To improve free energy calculations for duplexes containing 2 × 3 nucleotide loops that still do not have experimentally determined free energy contributions, an updated predictive model was derived. This new model resulted from a linear regression analysis of the data reported here combined with 31 previously studied 2 × 3 nucleotide internal loops. Most of the values for the parameters in this new predictive model are within experimental error of those of the previous models, suggesting that approximations and assumptions associated with the derivation of the previous nearest neighbor parameters were valid. The updated predictive model predicts free energies of 2 × 3 nucleotide internal loops within 0.4 kcal/mol, on average, of the experimental free energy values. Both the experimental values and the updated predictive model can be used to improve secondary structure prediction from sequence.
- Research Article
70
- 10.1186/1471-2105-8-33
- Jan 30, 2007
- BMC Bioinformatics
BackgroundAccurate identification of novel, functional noncoding (nc) RNA features in genome sequence has proven more difficult than for exons. Current algorithms identify and score potential RNA secondary structures on the basis of thermodynamic stability, conservation, and/or covariance in sequence alignments. Neither the algorithms nor the information gained from the individual inputs have been independently assessed. Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries.ResultsWe developed a shuffling algorithm, shuffle-pair.pl, that simultaneously preserves dinucleotide frequency, gaps, and local conservation in pairwise sequence alignments. We used shuffle-pair.pl to assess precision and recall of six ncRNA search tools (MSARI, QRNA, ddbRNA, RNAz, Evofold, and several variants of simple thermodynamic stability on a test set of 3046 alignments of known ncRNAs. Relative to mononucleotide shuffling, preservation of dinucleotide content in shuffling the alignments resulted in a drastic increase in estimated false-positive detection rates for ncRNA elements, precluding evaluation of higher order alignments, which cannot not be adequately shuffled maintaining both dinucleotides and alignment structure. On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone. Although the high false-positive rates call into question the veracity of any individual predicted secondary structural element in our analysis, we nevertheless identified intriguing global trends in human genome alignments. The distribution of ncRNA prediction scores in 75-base windows overlapping UTRs, introns, and intergenic regions analyzed using both thermodynamic stability and EvoFold (which has no thermodynamic component) was significantly higher for real than shuffled sequence, while the distribution for coding sequences was lower than that of corresponding shuffles.ConclusionAccurate prediction of novel RNA structural elements in genome sequence remains a difficult problem, and development of an appropriate negative-control strategy for multiple alignments is an important practical challenge. Nonetheless, the general trends we observed for the distributions of predicted ncRNAs across genomic features are biologically meaningful, supporting the presence of secondary structural elements in many 3' UTRs, and providing evidence for evolutionary selection against secondary structures in coding regions.
- Research Article
119
- 10.1002/pmic.201100196
- Aug 31, 2011
- PROTEOMICS
Compared with the protein 3-class secondary structure (SS) prediction, the 8-class prediction gains less attention and is also much more challenging, especially for proteins with few sequence homologs. This paper presents a new probabilistic method for 8-class SS prediction using conditional neural fields (CNFs), a recently invented probabilistic graphical model. This CNF method not only models the complex relationship between sequence features and SS, but also exploits the interdependency among SS types of adjacent residues. In addition to sequence profiles, our method also makes use of non-evolutionary information for SS prediction. Tested on the CB513 and RS126 data sets, our method achieves Q8 accuracy of 64.9 and 64.7%, respectively, which are much better than the SSpro8 web server (51.0 and 48.0%, respectively). Our method can also be used to predict other structure properties (e.g. solvent accessibility) of a protein or the SS of RNA.
- Conference Article
29
- 10.1109/bibm.2010.5706547
- Dec 1, 2010
Compared to the protein 3-class secondary structure (SS) prediction, the 8-class prediction gains less attention and is also much more challenging, especially for proteins with few sequence homologs. This paper presents a new probabilistic method for 8-class SS prediction using Conditional Neural Fields (CNFs), a recently-invented probabilistic graphical model. This CNF method not only models complex relationship between sequence features and SS, but also exploits interdependency among SS types of adjacent residues. In addition to sequence profiles, our method also makes use of non-evolutionary information for SS prediction. Tested on the CB513 and RS126 datasets, our method achieves Q8 accuracy 64.9% and 64.7%, respectively, which are much better than the SSpro8 web server (51.0% and 48.0%, respectively). Our method can also be used to predict other structure properties (e.g., solvent accessibility) of a protein or the SS of RNA.
- Research Article
17
- 10.1186/s12859-021-04365-4
- Sep 20, 2021
- BMC Bioinformatics
BackgroundStudies have proven that the same family of non-coding RNAs (ncRNAs) have similar functions, so predicting the ncRNAs family is helpful to the research of ncRNAs functions. The existing calculation methods mainly fall into two categories: the first type is to predict ncRNAs family by learning the features of sequence or secondary structure, and the other type is to predict ncRNAs family by the alignment among homologs sequences. In the first type, some methods predict ncRNAs family by learning predicted secondary structure features. The inaccuracy of predicted secondary structure may cause the low accuracy of those methods. Different from that, ncRFP directly learning the features of ncRNA sequences to predict ncRNAs family. Although ncRFP simplifies the prediction process and improves the performance, there is room for improvement in ncRFP performance due to the incomplete features of its input data. In the secondary type, the homologous sequence alignment method can achieve the highest performance at present. However, due to the need for consensus secondary structure annotation of ncRNA sequences, and the helplessness for modeling pseudoknots, the use of the method is limited.ResultsIn this paper, a novel method “ncDLRES”, which according to learning the sequence features, is proposed to predict the family of ncRNAs based on Dynamic LSTM (Long Short-term Memory) and ResNet (Residual Neural Network).ConclusionsncDLRES extracts the features of ncRNA sequences based on Dynamic LSTM and then classifies them by ResNet. Compared with the homologous sequence alignment method, ncDLRES reduces the data requirement and expands the application scope. By comparing with the first type of methods, the performance of ncDLRES is greatly improved.
- Abstract
- 10.1136/annrheumdis-2024-eular.6103
- Jun 1, 2024
- Annals of the Rheumatic Diseases
Background:Connective tissue diseases (CTDs) including systemic lupus erythematosus (SLE), primary Sjögren’s syndrome (pSS) and systemic sclerosis (SSc) frequently share clinical and serological features, rendering precise differentiation challenging. Currently used autoantibodies...
- Research Article
5
- 10.1007/978-1-4939-9045-0_22
- Jan 1, 2019
- Methods in molecular biology (Clifton, N.J.)
Two major components of posttranscriptional regulation are RNA-protein interactions and RNA secondary structure. While noncoding RNAs are far more abundant than messenger RNAs in eukaryotic systems, their functions remain largely unstudied. Evidence suggests that RNA-protein interactions and RNA secondary structure also regulate the function of long noncoding RNAs (lncRNAs), which are noncoding RNAs over 200 nucleotides (nt) in length. Protein interaction profile sequencing (PIP-seq) allows researchers to perform an unbiased screen of protein-bound regions and secondary structure of RNAs throughout a transcriptome of interest. Using a peak calling approach, our pipeline is able to identify protein-protected sites (PPSs), which are putative RNA-protein interaction sites. Additionally, by taking the ratio of read coverages in double-stranded RNA (dsRNA)-seq compared to single-stranded RNA (ssRNA)-seq libraries, our analysis can also calculate an RNA secondary structure score that reflects the likelihood of a region being comprised of double- or single-stranded ribonucleotides. Researchers can also use this pipeline to look at specific regions of interest, such as known lncRNAs, and determine their protein-bound status as well as elucidate their secondary structure.