Abstract

Since long non-coding RNAs (lncRNAs) have involved in a wide range of functions in cellular and developmental processes, an increasing number of methods have been proposed for distinguishing lncRNAs from coding RNAs. However, most of the existing methods are designed for lncRNAs in animal systems, and only a few methods focus on the plant lncRNA identification. Different from lncRNAs in animal systems, plant lncRNAs have distinct characteristics. It is desirable to develop a computational method for accurate and robust identification of plant lncRNAs. Herein, we present a plant lncRNA identification method ItLnc-BXE, which utilizes comprehensive features and the ensemble learning strategy. First, a diversity of sequence features is collected and filtered by feature selection to represent transcripts. Then, several base learners are trained and further combined into a single meta-learner by ensemble learning, and thus an ItLnc-BXE model is constructed. ItLnc-BXE models are evaluated on datasets of six plant species, the results show that ItLnc-BXE outperforms other state-of-the-art plant lncRNA identification methods, achieving better and robust performance (AUC>95.91%). We also perform some experiments about cross-species lncRNA identification, and the results indicate that dicots-based and monocots-based models can be used to accurately identify lncRNAs in lower plant species, such as mosses and algae. In addition, source codes and supplementary data are available at https://github.com/BioMedicalBigDataMiningLab/ItLnc-BXE.

Highlights

  • The recent improvements in high-throughput sequencing and the application of machine learning methods have led to the identification of numerous novel gene sequences [1]–[5]

  • EXPERIMENTAL SETTING We evaluate all ItLnc-BXE models on the datasets of six species: Arabidopsis thaliana (A), Solanum tuberosum (S), Oryza sativa (O), Hordeum vulgare (H), Physcomitrella patens (P) and Chlamydomonas reinhardtii (C)

  • FEATURE DISCUSSION Features are critical for distinguishing long non-coding RNAs (lncRNAs) from pcts, we consider a variety of features for the plant lncRNA identification

Read more

Summary

Introduction

The recent improvements in high-throughput sequencing and the application of machine learning methods have led to the identification of numerous novel gene sequences [1]–[5]. Since only a few lncRNAs have been annotated, many machine learning-based methods have been proposed for lncRNA identification, such as CPC2 [10], CPAT [11], PLEK [12] and etc. CPC2 employed an SVM model using the RBF kernel to distinguish coding RNAs from non-coding RNAs. CPAT used the logistic regression (LR) for novel lncRNA identification. PLEK applied a computational pipeline based on an improved k-mer scheme and an SVM algorithm. These methods were all alignment-free, which implied that they only made use of features derived directly from sequences. CPC2 constructed a feature set composed of four intrinsic features, which were

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call