Abstract

Long non-coding RNAs (lncRNAs) are a class of RNAs with the length exceeding 200 base pairs (bps), which do not encode proteins, nevertheless, lncRNAs have many vital biological functions. A large number of novel transcripts were discovered as a result of the development of high-throughput sequencing technology. Under this circumstance, computational methods for lncRNA prediction are in great demand. In this paper, we consider global sequence features and propose a stacked ensemble learning-based method to predict lncRNAs from transcripts, abbreviated as PredLnc-GFStack. We extract the critical features from the candidate feature list using the genetic algorithm (GA) and then employ the stacked ensemble learning method to construct PredLnc-GFStack model. Computational experimental results show that PredLnc-GFStack outperforms several state-of-the-art methods for lncRNA prediction. Furthermore, PredLnc-GFStack demonstrates an outstanding ability for cross-species ncRNA prediction.

Highlights

  • In the last two decades, a massive amount of novel transcript data was discovered due to the development of high-throughput sequencing techniques [1]

  • There have been many developed tools associating with Long non-coding RNAs (lncRNAs) activity, for example, LADP [15] is a tool for lncRNA-disease association prediction and LPLNP [16], SFPEL-LPI [17] are developed for lncRNA-protein interactions prediction

  • The average sizes of human optimal feature subsets and mouse optimal feature subsets are 133 and 121, respectively. These results indicate that genetic algorithm (GA)-random forest (RF) has promising performances for lncRNAs prediction on both human datasets and mouse datasets, i.e., the performance of Genetic Algorithm and Random Forest (GA-RF) is not affected by the differences of human and mouse with respect to species

Read more

Summary

Introduction

In the last two decades, a massive amount of novel transcript data was discovered due to the development of high-throughput sequencing techniques [1]. Wei et al [19] proposed an SVM-based method abbreviated as CPC, which assesses the protein-coding potential of a transcript based on multiple sequence features. The problem is that different models often employ different features in practice without experimenting on other features of lncRNAs. In addition, most of lncRNA prediction methods just select the features subjectively without using an effective feature selection method to consider various combinations of features, and use single classifier (e.g., SVM, RF) which still has the room for improvement on generalization performance. We consider global sequence features and propose a stacked ensemble learning-based method to differentiate long non-coding RNAs and coding RNAs, abbreviated as PredLnc-GFStack.

Datasets
Features Extraction
Stacked
Evaluation
Evaluation of the Optimal Feature Subsets
Evaluation of PredLnc-GFStack on Different Datasets
Comparison with Other Methods
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call