Abstract

Background: A central question in bioinformatics is how to minimize arbitrariness and bias in analysis of patterns of enrich-ment in data. A prime example of such a question is enrichment of gene ontology (GO) classes in lists of genes. Our paper deals with two issues within this larger question. One is how to calculate the false discovery rate (FDR) within a set of ap-parently enriched ontologies, and the second how to set that FDR within the context of assessing significance for addressing biological questions, to answer these questions we compare a random resampling method with a commonly used method for assessing FDR, the Benjamini-Hochberg (BH) method. We further develop a heuristic method for evaluating Type II (false negative) errors to enable utilization of F-Measure binary classification theory for distinguishing “significant” from “non-significant” degrees of enrichment. Results: The results show the preferability and feasibility of random resampling assessment of FDR over the analytical methods with which we compare it. They also show that the reasonableness of any arbitrary threshold depends strongly on the structure of the dataset being tested, suggesting that the less arbitrary method of F-measure optimization to determine significance threshold is preferable. Conclusion: Therefore, we suggest using F-measure optimization instead of placing an arbitrary threshold to evaluate the significance of Gene Ontology Enrichment results, and using resampling to replace analytical methods

Highlights

  • A central question in bioinformatics is how to minimize arbitrariness and bias in analysis of patterns of enrichment in data

  • For the four data sets examined, we have found that optimal F1 is the position that MCC reaches maximum

  • We considered the performance of the random resampling method on datasets created by corrupting the yeast ESR set by the addition of randomly selected genes, up to a level at which the dataset was 5/6 random genes and only 1/6 the original canonical dataset

Read more

Summary

Introduction

A central question in bioinformatics is how to minimize arbitrariness and bias in analysis of patterns of enrichment in data. The gene lists from experiments such as microarrays, are gathered into clusters associated with biological attributes, and defined as GO terms [1]. GO aspires to Optimal F-Measure Gene Annotation Enrichment create a formal naming system to define the biologically significant attributes of genes across all organisms. Each enriched GO term derived from a list of genes is evaluated by its significance level, i.e., the probability that the measured enrichment would be matched or exceeded by pure chance. Enrichment tools have been developed to process large gene lists with the goal of inferring significantly enriched ontologies. GO-Miner [7], Babelomics [8], FatiGO [9], GSEA [10, 11], and ErmineJ [12] apply resampling or permutation algorithms on random sets to evaluate the number of false positives in computed gene ontologies associated with test sets. David [4] and Babelomics [8] introduced levelspecific enrichment analysis; that is, not including both parents and children terms. Cytoscape plugin in BinGO [16] is associated with output tree graphs

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.