Abstract

BackgroundSerial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.ResultsThe goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.ConclusionThe Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.

Highlights

  • Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome

  • Such tests require tag counts be normalized by division with the total library size; this removal of library size from the set of sufficient statistics discards an informative facet of the data

  • Goodness of fit In order to evaluate the efficacy of a mixture model approach, a comparison of the goodness of fit of this and previously described models on 15 sets of biological replicates from publicly available SAGE data was performed

Read more

Summary

Introduction

Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. When testing for differential expression between groups, where each group can contain multiple libraries, statistical methods that incorporate a continuous probability distribution (e.g. the Normal distribution assumed by Student's t-test) should be avoided. Such tests require tag counts be normalized by division with the total library size; this removal of library size from the set of sufficient statistics discards an informative facet of the data

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call