PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants.

Lucas Vieira,Sergio Lifschitz,Thais Carvalho,Flavia Thiebaut,Clicia Grativol,Adriana Hemerly,Maria Walter,Pablo Hardoim,Paulo Ferreira

doi:10.3390/ncrna3010011

Lucas Vieira, Sergio Lifschitz + Show 7 more

Open Access

https://doi.org/10.3390/ncrna3010011

Copy DOI

Abstract

Non-coding RNAs (ncRNAs) constitute an important set of transcripts produced in the cells of organisms. Among them, there is a large amount of a particular class of long ncRNAs that are difficult to predict, the so-called long intergenic ncRNAs (lincRNAs), which might play essential roles in gene regulation and other cellular processes. Despite the importance of these lincRNAs, there is still a lack of biological knowledge and, currently, the few computational methods considered are so specific that they cannot be successfully applied to other species different from those that they have been originally designed to. Prediction of lncRNAs have been performed with machine learning techniques. Particularly, for lincRNA prediction, supervised learning methods have been explored in recent literature. As far as we know, there are no methods nor workflows specially designed to predict lincRNAs in plants. In this context, this work proposes a workflow to predict lincRNAs on plants, considering a workflow that includes known bioinformatics tools together with machine learning techniques, here a support vector machine (SVM). We discuss two case studies that allowed to identify novel lincRNAs, in sugarcane (Saccharum spp.) and in maize (Zea mays). From the results, we also could identify differentially-expressed lincRNAs in sugarcane and maize plants submitted to pathogenic and beneficial microorganisms.

Highlights

The development of new techniques of high-throughput sequencing and the large amount of sequencing projects have been creating enormous volumes of biological data [1], which have been revealing an increasingly number of non-coding RNAs in eukaryotic genomes [2]
Between two distinct non-coding RNAs (ncRNAs) classifications we have (a) small RNAs, those well-known structured RNAs with lengths between 20 and 30 nucleotides, and (b) long ncRNAs, those presenting more than 200 nucleotides and a poor capacity to code proteins, which represent the least understood transcripts today [4,5,6]
No sugarcane genome is publicly available in biological databases, and no information about its lincRNAs is known

Summary

Introduction

The development of new techniques of high-throughput sequencing and the large amount of sequencing projects have been creating enormous volumes of biological data [1], which have been revealing an increasingly number of non-coding RNAs (ncRNAs) in eukaryotic genomes [2]. These ncRNAs directly act in the cellular structures, as well as in catalytic and regulatory processes [3].

Methods

Results

Conclusion