Abstract
BackgroundNCRNAs (noncoding RNAs) play important roles in many biological processes. Existing genome-scale ncRNA search tools identify ncRNAs in local sequence alignments generated by conventional sequence comparison methods. However, some types of ncRNA lack strong sequence conservation and tend to be missed or mis-aligned by conventional sequence comparison.ResultsIn this paper, we propose an ncRNA identification framework that is complementary to existing sequence comparison tools. By integrating a filtration step based on Hamming distance and ncRNA alignment programs such as FOLDALIGN or PLAST-ncRNA, the proposed ncRNA search framework can identify ncRNAs that lack strong sequence conservation. In addition, as the ratio of transition and transversion mutation is often used as a discriminative feature for functional ncRNA identification, we incorporate this feature into the filtration step using a coding strategy. We apply Hamming distance seeds to ncRNA search in the intergenic regions of human and mouse genomes and between the Burkholderia cenocepacia J2315 genome and the Ralstonia solanacearum genome. The experimental results demonstrate that a carefully designed Hamming distance seed can achieve better sensitivity in searching for poorly conserved ncRNAs than conventional sequence comparison tools.ConclusionsHamming distance seeds provide better sensitivity as a filtration strategy for genome-wide ncRNA homology search than the existing seeding strategies used in BLAST-like tools. By combining Hamming distance seeds matching and ncRNA alignment, we are able to find ncRNAs with sequence similarities below 60%.
Highlights
Identifying ncRNAs, which function directly as RNAs rather than being translated into proteins, has drawn tremendous attention recently for two main reasons
As we are only interested in ncRNA homologs with low sequence similarities, we examine the PLASTncRNA probabilities for tRNA and SECIS homologs between human and mouse because these two have low sequence conservations
Our experimental results show that HD seed matching provides an effective and efficient filtration step for genome-scale ncRNA search
Summary
Identifying ncRNAs (non-coding RNAs), which function directly as RNAs rather than being translated into proteins, has drawn tremendous attention recently for two main reasons. Existing genome-scale ncRNA identification methods [2,3,4] first employ conventional sequence comparison tools such as BLAST [5] to generate an initial set of alignments for further screening. Features such as secondary structure conservation, minimum free energy (MFE), sequence conservation, GC content, base or basepair substitution patterns etc. BLAST-like sequence comparison tools have been successfully used for finding protein-coding genes, segment duplications, and other genomic features, they are not well suited for comprehensive ncRNA search. Existing genome-scale ncRNA search tools identify ncRNAs in local sequence alignments generated by conventional sequence comparison methods. Some types of ncRNA lack strong sequence conservation and tend to be missed or mis-aligned by conventional sequence comparison
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have