Abstract

Knowledge of the exact transcriptional start site of a gene is crucial for the assignment and study of its promoter region and, sometimes, to obtain the entire protein coding region. Correct assignment of the transcriptional start site is not an easy task. Most cDNAs currently deposited in databases do not contain the sequences from the 5′ ends of genes largely because of the difficulty in obtaining full-length clones. Even though sequencing of the human genome has been completed, the number of entries in the eukaryotic promoter database is only in the hundreds.In a recent report, Suzuki et al. [1xDBTSS: database of human transcriptional start sites and full-length cDNAs. Suzuki, Y et al. Nucleic Acids Res. 2002; 30: 328–331Crossref | PubMedSee all References][1] describe a database containing the transcriptional start sites of thousands of genes along with their location on the human draft genome sequence. They created this database by generating and sequencing cDNA libraries that are specifically derived from mRNA species containing a 5′ cap structure (oligo-capping method). This procedure enriches for full-length clones. Over 111 000 clones derived from 132 libraries were used to extend the 5′ ends of these genes by an average of 87 base pairs. They obtained clones corresponding to 7889 human genes in the reference sequence database (RefSeq) project initiated by the National Center for Biotechnology Information (NCBI) and were able to further extend the 5′ ends of 59% of these. Strikingly, the authors found that the transcriptional start sites for >1000 of these genes were located >10 kilobases upstream of the currently assigned sites owing to the presence of large introns.This database (http://elmo.ims.u-tokyo.ac.jp/dbtss) can be searched using a gene symbol, gene definition, RefSeq ID, Locus link ID or UniGene ID. Alternatively, one can perform a BLAST search against the database to see whether a longer version of a cDNA exists. The search results are presented in an organized format: the top part of the output shows the genomic context of the gene (exon–intron structure) with the transcriptional start site(s) clearly marked, whereas the bottom part depicts the original RefSeq sequence along with any additional 5′ sequence. This database, which will be updated periodically, will be useful for anyone who wishes to characterize and study human genes in detail. One would hope that the additional sequence data generated by Suzuki et al. would ultimately be incorporated into RefSeq and other databases for wider dissemination of the information.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call