Finding noncoding RNA transcripts from low abundance expressed sequence tags

Chenghai Xue,Fei Li

doi:10.1038/cr.2008.59

Abstract

It has been proved that noncoding RNA (ncRNA) genes are much more numerous than expected. However, it remains a difficult task to identify ncRNAs with either computational algorithms or biological experiments. Recent reports have suggested that ncRNAs may also appear in the expressed sequence tags (EST's) database. Nevertheless, intergenic ESTs have received little attention and are poorly annotated owing to their low abundance. Here, we have developed a computational strategy for discovering ncRNA genes from human ESTs. We first collected ESTs that are located in the intergenic regions and do not have detailed annotations. The intergenic regions were divided into non-overlapping 50-nt windows and PhastCons scores obtained from the UCSC database were assigned to these windows. We kept conserved windows that had PhastCons scores of over 0.8 and that had at least three supporting ESTs to act as seeds. Each cluster of ESTs corresponding to the seeds was assembled into a long contig. We used two criteria to screen for ncRNA transcripts from these contigs: the first was that the longest predicted open reading frame was less than 300 nt and the second was that the likely Pol-II promoters exist within 2,000 nt upstream or downstream of the contigs. As a result, 118 novel ncRNA genes were identified from human low abundance ESTs. Of seven randomly selected candidates, six were transcribed in human 2BS cells as shown by RT-PCR. Our work proves that the EST is a 'hidden treasure' for detecting novel ncRNA genes.

Full Text