Abstract

Background:Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as equally probable "balls in a bag", as may be appropriate for transcripts in micro-array data. However, due to the varying length of genes and intergenic regions, that approach is inappropriate for deciding if any GO terms are correlated with a set of genomic positions.Results:We present an algorithm – GONOME – that can determine which GO terms are significantly associated with a set of genomic positions given a genome annotated with (at least) the starts and ends of genes. We show that certain GO terms may appear to be significantly associated with a set of randomly chosen positions in the human genome if gene lengths are not considered, and that these same terms have been reported as significantly over-represented in a number of recent papers. This apparent over-representation disappears when gene lengths are considered, as GONOME does. For example, we show that, when gene length is taken into account, the term "development" is not significantly enriched in genes associated with human CpG islands, in contradiction to a previous report. We further demonstrate the efficacy of GONOME by showing that occurrences of the proteosome-associated control element (PACE) upstream activating sequence in the S. cerevisiae genome associate significantly to appropriate GO terms. An extension of this approach yields a whole-genome motif discovery algorithm that allows identification of many other promoter sequences linked to different types of genes, including a large group of previously unknown motifs significantly associated with the terms 'translation' and 'translational elongation'.Conclusion:GONOME is an algorithm that correctly extracts over-represented GO terms from a set of genomic positions. By explicitly considering gene size, GONOME avoids a systematic bias toward GO terms linked to large genes. Inappropriate use of existing algorithms that do not take gene size into account has led to erroneous or suspect conclusions. Reciprocally GONOME may be used to identify new features in genomes that are significantly associated with particular categories of genes.

Highlights

  • Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as probable "balls in a bag", as may be appropriate for transcripts in micro-array data

  • GONOME: Gene Ontology correlations in the genome We have developed a new application, called GONOME [6], which calculates the statistical significance of the correlation between a set of genomic positions and their associated Gene Ontology terms

  • GONOME does this by applying a random model that assumes that each position in the portion of the genome under consideration is likely

Read more

Summary

Introduction

Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as probable "balls in a bag", as may be appropriate for transcripts in micro-array data. The Gene Ontology (GO) project [1] arose partly in response to the problem of non-uniform assignment of genomic annotations. Biological databases are notorious for the inconsistency of their annotation terminology, and attempts to apply statistical methods based on annota-. GO has become a popular way of analyzing sets of genes to find under- or over-represented terms associated with that set of genes, especially in expression micro-array datasets. For example, apply a "GO analysis" to sets of up- or down-regulated genes to assess which processes or functions are undergoing coordinated regulation. A variety of web-based tools exist that allow one to enter a list of gene identifications and find the over- and underrepresented GO terms associated to those genes – for example GOstat [2] and GO::TermFinder [3]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call