Identification of long non-coding transcripts with feature selection: a comparative study

Giovanna M M Ventola,Michele Ceccarelli,Antonietta Spagnuolo,Salvatore D’Aniello,Luigi Cerulo,Teresa M R Noviello

doi:10.1186/s12859-017-1594-z

Abstract

BackgroundThe unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.ResultsIn this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.ConclusionsUnderstanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/.

Highlights

The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data
The most relevant tools in this category are: IseeRNA – limited to the subclass of long intergenic ncRNA (lincRNA) and is based on a Support Vector Machine classifier trained with conservation score, open reading frame length, and di/tri-nucleotide sequence frequencies [7]; PLEK – uses a Support Vector Machine trained with an improved k-mer scheme to distinguish Long non-coding RNA (lncRNA) from messenger RNAs in the absence of genomic sequences or annotations [8]; lncRNA-MFDL – uses a deep learning algorithm with multiple features of the open reading frame, k-mer, secondary structure, and the mostlike coding domain sequence [9]; and Lv et al – uses LASSO regularization trained with genomic and chromatin features [10]
Some of them demonstrate obvious associations, such as: transcript length (TxLen) and Open reading frame (ORF) length (OrfLen), conservation scores computed with alternative tools (PhyloP and PhasCons), and di-/trinucleotides encoding similar information (TT vs TTT, GG vs GGG, CC vs CCC, AA vs AAA, GC vs GCC, TA vs ATA/TAT, GA vs AGA)

Summary

Introduction

The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. The classifier is used to predict new potential lncRNAs. The most relevant tools in this category are: IseeRNA – limited to the subclass of lincRNAs and is based on a Support Vector Machine classifier trained with conservation score, open reading frame length, and di/tri-nucleotide sequence frequencies [7]; PLEK – uses a Support Vector Machine trained with an improved k-mer scheme to distinguish lncRNAs from messenger RNAs (mRNAs) in the absence of genomic sequences or annotations [8]; lncRNA-MFDL – uses a deep learning algorithm with multiple features of the open reading frame, k-mer, secondary structure, and the mostlike coding domain sequence [9]; and Lv et al – uses LASSO regularization trained with genomic and chromatin features [10]

Methods

Results

Conclusion