Abstract
BackgroundLarge scale gene analysis of most organisms is hampered by incomplete genomic sequences. In many organisms, such as soybean, the best source of sequence information is the existence of expressed sequence tag (EST) libraries. Soybean has a large (1115 Mbp) genome that has yet to be fully sequenced. However it does have the 6th largest EST collection comprised of ESTs from a variety of soybean genotypes. Many EST libraries were constructed from RNA extracted from various genetic backgrounds, thus gene identification from these sources is complicated by the existence of both gene and allele sequence differences. We used the ESTminer suite of programs to identify potential soybean gene transcripts from a single genetic background allowing us to observe functional classifications between gene families as well as structural differences between genes and gene paralogs within families. The identification of potential gene sequences (pHaps) from soybean allows us to begin to get a picture of the genomic history of the organism as well as begin to observe the evolutionary fates of gene copies in this highly duplicated genome.ResultsWe identified approximately 45,000 potential gene sequences (pHaps) from EST sequences of Williams/Williams82, an inbred genotype of soybean (Glycine max L. Merr.) using a redundancy criterion to identify reproducible sequence differences between related genes within gene families. Analysis of these sequences revealed single base substitutions and single base indels are the most frequently observed form of sequence variation between genes within families in the dataset. Genomic sequencing of selected loci indicate that intron-like intervening sequences are numerous and are approximately 220 bp in length. Functional annotation of gene sequences indicate functional classifications are not randomly distributed among gene families containing few or many genes.ConclusionThe predominance of single nucleotide insertion/deletions and substitution events between genes within families (individual genes and gene paralogs) is consistent with a model of gene amplification followed by single base random mutational events expected under the classical model of duplicated gene evolution. Molecular functions of small and large gene families appear to be non-randomly distributed possibly indicating a difference in retention of duplicates or local expansion.
Highlights
Large scale gene analysis of most organisms is hampered by incomplete genomic sequences
The Cap3 parameters were adjusted with the goal of including all members of a gene family in the Cap3 consensus sequence, even though this allowed the inclusion of some expressed sequence tag (EST) that were only distantly related or whose shared similarity was based on only a relatively short motif
Clustering of the 196,867 All-Williams (AW) ESTs using Cap3 resulted in 17,463 Cap3 consensus sequences and 56,430 Cap3 singletons. Preliminary analysis of these results revealed that Cap3 did not consistently include all of the related ESTs in an alignment
Summary
Large scale gene analysis of most organisms is hampered by incomplete genomic sequences. The majority of efforts at gene discovery for many organisms, including soybean, has been through the sampling and partial sequencing of gene transcripts (expressed sequence tags or ESTs) [2] Such EST data form a valuable foundation for the understanding of the gene composition and genomic biology of yet-to-be fully sequenced genomes [3]. The more recent duplication event in particular would be expected to result in many paralogous pairs of genes differing by relatively few sequence differences, complicating gene identification using ESTs. some preliminary studies have examined the level of sequence variation between selected genes and their alleles in soybean [7], no systematic analysis of this important subject has been done until now
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have