Abstract
BackgroundEffective bioinformatics solutions are needed to tackle challenges posed by industrial-scale genome annotation. We present Bcheck, a wrapper tool which predicts RNase P RNA genes by combining the speed of pattern matching and sensitivity of covariance models. The core of Bcheck is a library of subfamily specific descriptor models and covariance models.ResultsScanning all microbial genomes in GenBank identifies RNase P RNA genes in 98% of 1024 microbial chromosomal sequences within just 4 hours on single CPU. Comparing to existing annotations found in 387 of the GenBank files, Bcheck predictions have more intact structure and are automatically classified by subfamily membership. For eukaryotic chromosomes Bcheck could identify the known RNase P RNA genes in 84 out of 85 metazoan genomes and 19 out of 21 fungi genomes. Bcheck predicted 37 novel eukaryotic RNase P RNA genes, 32 of which are from fungi. Gene duplication events are observed in at least 20 metazoan organisms. Scanning of meta-genomic data from the Global Ocean Sampling Expedition, comprising over 10 million sample sequences (18 Gigabases), predicted 2909 unique genes, 98% of which fall into ancestral bacteria A type of RNase P RNA and 66% of which have no close homolog to known prokaryotic RNase P RNA.ConclusionsThe combination of efficient filtering by means of a descriptor-based search and subsequent construction of a high-quality gene model by means of a covariance model provides an efficient method for the detection of RNase P RNA genes in large-scale sequencing data.Bcheck is implemented as webserver and can also be downloaded for local use from http://rna.tbi.univie.ac.at/bcheck
Highlights
Effective bioinformatics solutions are needed to tackle challenges posed by industrial-scale genome annotation
We present Bcheck, a wrapper, to perform efficient rnpB gene prediction by combining the fast filtering with rnabob[17] and the sensitive validation by Infernal. The construction of such a method entails two tasks: the design of an efficient yet sensitive descriptor model (DM) that acts as a filter, and the derivation of a sensitive statistics covariance model (CM)
The success of Bcheck depends on the efficiency and predictive power of both models, as well as a sensible wrapping algorithm that optimizes the interplay of DM and CM
Summary
2.1 Algorithm and models The construction of effective models of RNase P RNA genes is a non-trivial task because of the lack of strong family-specific conservation. To distinguish functional copy and pseudogene of eukaryotes, we analyzed their promoter regions For this purpose we aligned 100 nt upstream of Polymerase III transcripts of the same organism and compared the RNase P RNA gene predictions. After removing duplicate sequences from closely related strains, we obtained 777 unique rnpB genes of which 45 belong to arcA, 10 to arcM, 621 to bacA, and 101 to bacB, see Table 3 below. The GenBank files contained annotated rnpB genes for 365 bacteria and 22 archaea, all of which were among the Bcheck predictions. Even though the promoter might be specific for an organism, it may differ from other polymerase III transcripts within one species In each of these cases, a presumably functional RNase P RNA like promoter structure was found for only one of the copies. The Bcheck-pipeline can be downloaded from the same location for local use in a Linux environment
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have