M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species.

Xiaoli Qiang,Ran Su,Huangrong Chen,Leyi Wei,Xiucai Ye

doi:10.3389/fgene.2018.00495

Abstract

As one of the well-studied RNA methylation modifications, N6-methyladenosine (m6A) plays important roles in various biological progresses, such as RNA splicing and degradation, etc. Identification of m6A sites is fundamentally important for better understanding of their functional mechanisms. Recently, machine learning based prediction methods have emerged as an effective approach for fast and accurate identification of m6A sites. In this paper, we proposed “M6AMRFS”, a new machine learning based predictor for the identification of m6A sites. In this predictor, we exploited a new feature representation algorithm to encode RNA sequences with two feature descriptors (dinucleotide binary encoding and Local position-specific dinucleotide frequency), and used the F-score algorithm combined with SFS (Sequential Forward Search) to enhance the feature representation ability. To predict m6A sites, we employed the eXtreme Gradient Boosting (XGBoost) algorithm to build a predictive model. Benchmarking results showed that the proposed predictor is competitive with the state-of-the art predictors. Importantly, robust predictions for multiple species by our predictor demonstrate that our predictive models have strong generalization ability. To the best of our knowledge, M6AMRFS is the first tool that can be used for the identification of m6A sites in multiple species. To facilitate the use of our predictor, we have established a user-friendly webserver with the implementation of M6AMRFS, which is currently available in http://server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool for the relevant research of m6A sites.

Highlights

To date, more than 150 types of RNA modifications have been discovered (Maden, 1990; Wang X. et al, 2014)
Chen et al (2015a) developed the first machine learning based predictor, called “iRNA-Methyl”, for m6A site identification. They exploited physicochemical properties and sequence-order information embedded in PseDNC (Liu B. et al, 2015), and used support vector machine for model construction
For feature extraction (Mrozek et al, 2007, 2013), we proposed a feature representation algorithm to encode sequences with dinucleotide binary encoding and local positionspecific dinucleotide frequency

Summary

INTRODUCTION

More than 150 types of RNA modifications have been discovered (Maden, 1990; Wang X. et al, 2014). Chen et al (2015a) developed the first machine learning based predictor, called “iRNA-Methyl”, for m6A site identification They exploited physicochemical properties and sequence-order information embedded in PseDNC (pseudo dinucleotide composition) (Liu B. et al, 2015), and used support vector machine for model construction. To improve the predictive performance, Jia et al proposed to fuse three types of feature descriptors, such as bi-profile Bayes, dinucleotide composition and KNN scores. Their results showed that this fusion strategy is able to achieve better performance than single one feature descriptor (Jia et al, 2016). We anticipate that it will be a useful tool complementary for existing tools, facilitating to further reveal the functional mechanisms of m6A sites

MATERIALS AND METHODS

RESULTS AND DISCUSSION

Methods

CONCLUSION