Abstract

BackgroundPublicly accessible EST libraries contain valuable information that can be utilized for studies of tissue-specific gene expression and processing of individual genes. This information is, however, confounded by multiple systematic effects arising from the procedures used to generate these libraries.ResultsWe used alignment of ESTs against a reference set of transcripts to estimate the size distributions of the cDNA inserts and sampled mRNA transcripts in individual EST libraries and show how these measurements can be used to inform quantitative comparisons of libraries. While significant attention has been paid to the effects of normalization and substraction, we also find significant biases in transcript sampling introduced by the combined procedures of reverse transcription and selection of cDNA clones for sequencing. Using examples drawn from studies of mRNA 3'-processing (cleavage and polyadenylation), we demonstrate effects of the transcript sampling bias, and provide a method for identifying libraries that can be safely compared without bias. All data sets, supplemental data, and software are available at our supplemental web site [1].ConclusionThe biases we characterize in the transcript sampling of EST libraries represent a significant and heretofore under-appreciated source of false positive candidates for tissue-, cell type-, or developmental stage-specific activity or processing of genes. Uncorrected, quantitative comparison of dissimilar EST libraries will likely result in the identification of statistically significant, but biologically meaningless changes.

Highlights

  • Accessible EST libraries contain valuable information that can be utilized for studies of tissue-specific gene expression and processing of individual genes

  • That even non-normalized libraries are subject to systematic biases that can distort quantitative studies

  • We aligned all ESTs from each library to a reference transcript set, and used the results to estimate the length distributions of the cDNA inserts and sampled transcripts

Read more

Summary

Introduction

Accessible EST libraries contain valuable information that can be utilized for studies of tissue-specific gene expression and processing of individual genes. This information is, confounded by multiple systematic effects arising from the procedures used to generate these libraries. While EST-based gene discovery can be quite successful, the wide dynamic range of mRNA abundance and the cost of EST creation led to the development of procedures such as normalization and subtraction [13,22], which increase the likelihood of sampling rare or tissue-specific transcripts, at the cost of lost quantitative relationships between different transcripts in a library. The estimated length distributions of cDNA inserts (red bars) and sampled transcripts (black bars) in a mouse EST library generated from round spermatids (McCarrey, J., Eddy, M. et al, unpublished data). The length distributions of the ENSEMBL [44] and PACdb [21] reference transcripts are plotted in blue and green, respectively

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call