Abstract
We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3′ end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.
Highlights
The number of non-coding RNAs in eukaryotic genomes is one of the pressing open questions of genomics
Our predictions may be associated with introns of unannotated protein-coding genes. 19 of our predictions scoring as small nucleolar RNAs (snoRNAs) correspond to the single RFAM family snoR28, and 17 of these appear in a tandem array on the X chromosome
As a first step towards functional characterization of proteincoding genes with predicted structurally-conserved elements in their 39 and 59 untranslated regions (UTRs) and introns, we identified enriched Gene Ontology (GO) terms with GO::TermFinder [42]
Summary
The number of non-coding RNAs (ncRNAs) in eukaryotic genomes is one of the pressing open questions of genomics. This program, xrate, allows the grammar structure to be specified in a configuration file; the parameters can be automatically estimated from training data and the parameterized phylo-grammar used to annotate new alignments This program implements a wide variety of models and can be used for measurement of evolutionary rates, or prediction of RNA (or protein) secondary structure. Using one of the grammars, we scan a multiple alignment of twelve Drosophila genomes for novel ncRNAs. As well as reproducing many of the predictions of earlier bioinformatics screens in Drosophila [11,13,28], our screen predicts numerous novel structured RNAs, lending support to the hypothesis that eukaryotic genomes are dense with ncRNAs. the simulation procedure that we use (which includes locally conserved regions that are not ncRNAs) suggests that false positive rates for ncRNA prediction are higher than previously reported. Our methods point the way to further evidence-based evaluations of whole-genome bioinformatics screens
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have