Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Nathalie Pavy,Jerry Liu,Asim Siddiqui,Jeff Stott,Charles Paule,John Mackay,George Yang,Sarah Barber,Jean Bousquet,Carine Guillet-Claude,Etienne Noumen,James E Johnson ,Marie Josee Morency ,Robert A Holt ,Yaron S.n Butterfield ,Armand Séguin ,John A Crow ,Robert B Kirkpatrick ,Janice E K Cooke ,Lee S Parsons ,Ernest F Retzel ,Marco A Marra

doi:10.1186/1471-2164-6-144

Abstract

BackgroundThe sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss).ResultsWe produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations.ConclusionThis report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.

Highlights

The sequencing and analysis of ESTs is for the only practical approach for largescale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future
Library development and resulting sequences Tissue sampling and EST sequencing strategies The cDNA libraries were developed with the goal of augmenting the representation of conifer transcripts available in public databases, and to support experimental goals related to vascular development
Among the quality reads 33.5% were from secondary vascular tissues, 32.2% were from roots, 16.7% from young shoots, and the remaining 17.6% were from various organs including male strobili, female cones, buds, somatic embryos, and needles (Table 1)

Summary

Introduction

The sequencing and analysis of ESTs is for the only practical approach for largescale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Genomics projects have been initiated in several pine and spruce species to identify genes involved in traits of economic interest and of ecological significance in conifers It is unlikely, that conifer genomes will be completely sequenced in the near future because of their shear size [1]. It is unknown how widespread this phenomenon may be; the finding suggests that conserved protein motifs may be unambiguously identified, the biological role of genes belonging to conifer protein families may not be readily inferred from their Angiosperm homologs These data would support the argument in favour of thorough cDNA sequencing projects in conifers because they are distantly related to model Angiosperms like Arabidopsis, in order to fully characterize protein families

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Oct 19, 2005
Citations: 143	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice.
Xiyin Wang ... Zhe Li
BMC Bioinformatics | VOL. 7
Xiyin Wang, et. al.Xiyin Wang ... Zhe Li
12 Oct 2006
BMC Bioinformatics | VOL. 7

Genome-wide identification and expression analysis of rice cell cycle genes
Jing Guo ... Fang Wang
Plant Molecular Biology | VOL. 64
Jing Guo, et. al.Jing Guo ... Fang Wang
19 Apr 2007
Plant Molecular Biology | VOL. 64

The distribution of T-DNA in the genomes of transgenic Arabidopsis and rice
Abdelali Barakat ... Giorgio Bernardi
FEBS Letters | VOL. 471
Abdelali Barakat, et. al.Abdelali Barakat ... Giorgio Bernardi
10 Apr 2000
FEBS Letters | VOL. 471

The development of sequence tagged site marker sets for disease resistance gene analogs in maize (Zea mays L.)
Seishi Ikeda ... Masanori Muraki
Grassland Science | VOL. 51
Seishi Ikeda, et. al.Seishi Ikeda ... Masanori Muraki
01 Mar 2005
Grassland Science | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics