Abstract

As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.

Highlights

  • RNA posttranscriptional modification (PTM) is a common phenomenon in biological processes [1]

  • We select locally linear embedding (LLE) [46], spectral embedding (SE) [47], XGBoost [48], light gradient boosting machine (LightGBM) [49], principle component analysis (PCA) [50], Boruta [51], singular value decomposition (SVD) [50], and elastic net (EN) methods to reduce the dimensionality of the initial feature space and the difficulty of the learning task

  • When using EN as the feature selection method, we first assign scores to each feature based on their own coefficients in the EN with regularization, sort the features in descending order according to the importance scores, and remove the features with an importance score of zero

Read more

Summary

Introduction

RNA posttranscriptional modification (PTM) is a common phenomenon in biological processes [1]. N7methylguanosine is a positively charged RNA modification, which is produced by the addition of a methyl group at position N7 of riboguanosine [2, 3]. Its expression level is regulated by methyltransferase [4, 5]. Researches have shown that m7G plays a critical role in almost every stage of the life cycle of mRNA, including regulating mRNA splicing, nuclear export of mRNA, mRNA stability, translation, and transcription [6,7,8,9,10,11]. Due to the importance and particularity of the N7-methylguanosine, accurate determination of the distribution of m7G in transcriptome is the basis for the in-depth understanding of its biological functions and modification mechanisms

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call