Sequencing of a Transcriptome The rapid completion and public release of the genome sequences of mouse and human has led to a downgrading of the number of “genes” predicted in the mammalian genome to the region of 30,000 (Mouse Genome Sequencing Consortium, Waterston et al. 2002). In simpler organisms such as yeast, the estimate of gene number is comparatively straightforward, because the majority of the genome clearly encodes proteins, and individual genes generally have a welldefined start and finish and a single mRNA output. In mammals, the task is muchmore complex. Only a small proportion of the genome encodes mRNAs that in turn encode protein, and protein-coding sequence is interspersed with large introns or intergenic regions. Even protein coding genes have proven difficult to annotate reliably (Kawai et al. 2001), and non–protein coding genes are essentially impossible to annotate a priori. The key to reliable annotation of a mammalian genome is the comprehensive characterization of the transcriptional output, the transcriptome. There are two approaches to this problem. The most common is highthroughput sequencing of cDNA ends (ESTs). In mouse and human, and to a lesser extent in many other mammals, there are millions of EST sequences in various repositories. EST sequences can be computationally assembled into clusters, as in the UniGene projects (http : / /www.ncbi .nlm.nih.gov/ UniGene). There are many drawbacks with this approach, both from the cDNA cloning and sequence quality and from computational perspectives, but the most compelling is that the sequences are generated in silico and are not necessarily supported by a physical clone. It is also rather inefficient, because even with the best subtraction and normalization, abundant transcripts have been sequenced thousands of times, whereas many rare transcripts are absent from EST databases. EST assemblies are particularly difficult to interpret when there are multigene families or complex alternative splicing. The alternative approach is to systematically isolate and sequence fulllength cDNAs. The logistics of this approach are daunting, and it is actually far more challenging than is genomic sequencing, especially using shotgun approaches because of the difficulties in the collection of the samples. Nevertheless, the RIKEN Mouse Gene Encyclopedia Project has taken this approach. In the process, the RIKEN team has provided a model for eukaryotic transcriptome projects. The task required a range of new technologies and approaches. In outline, the RIKEN team developed new approaches to production of full-length cDNAs (Carninci et al. 2003) that required (1) a novel reverse transcriptase reaction (to enable effective complete firststrand synthesis), (2) novel 5 end capture technology, and (3) novel approaches to normalization and subtraction of cDNA libraries. Starting with their first libraries, the RIKEN team sequenced 3 ends (and later 5 ends) in a Phase 1 sequencing pipeline and, for each individual clone, determined whether the sequence had been sequenced previously or could be ascribed to a new cluster. In the second phase, individual representatives of EST clusters were selected and fully sequenced to produce a full-length cDNA sequence representing the sequence of an individual physical clone. At a number of stages in the project, the RIKEN team assembled a set of cDNAs that had previously been sequenced and used them to subtract successive libraries. The success of the approach is outlined in detail in Carninci et al. (2003). The output of this pipeline was analyzed in the FANTOM2 meeting (April 29 to May 5, 2002, Yokohama, Japan), which is the basis of this special issue of Genome Research.
Read full abstract