Prediction of plant lncRNA by ensemble machine learning classifiers

Caitlin M. A. Simopoulos,G. Brian Golding,Elizabeth A. Weretilnyk

doi:10.1186/s12864-018-4665-2

Caitlin M. A. Simopoulos, G. Brian Golding + Show 1 more

Open Access

https://doi.org/10.1186/s12864-018-4665-2

Copy DOI

Journal: BMC Genomics	Publication Date: May 2, 2018
Citations: 52	License type: open-access

Affiliation: McMaster University

Abstract

BackgroundIn plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation.ResultsIndividual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified.ConclusionsThis ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function.

Highlights

In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses
We found the most successful method to be a stacking meta-learner constructed from eight stochastic gradient boosting models
Individual random forest and stochastic gradient boosting model construction Feature selection Researchers have proposed that specific characters in transcript sequences can be useful in Long non-protein coding RNA (lncRNA) classification

Summary

Introduction

Long non-protein coding RNAs are believed to have essential roles in development and stress responses. A recent review by Ma et al [10] suggests that most known lncRNAs regulate transcription, both in cis and trans, while others can affect translation, splicing, post-translational regulation or are classified as “other functional mechanisms.” Due to such a wide range of functionality, lncRNAs are typically classified by their position to protein coding genes as intergenic ( referred to as lincRNAs), natural antisense, or intronic [1, 10]. Demonstrating minimal homology with close relatives [5], current research suggests these transcripts undergo fast and unclear evolution making functional predictions challenging This lack of distinct rules for predicting and identifying lncRNAs is a likely contributor to the lack of validated plant lncRNAs

Methods

Results

Conclusion