Abstract

Gene set analysis is a well-established approach for analyzing high-throughput gene expression data. The choice of gene set database used for gene set analysis may affect the outcome of the analysis. Therefore, understanding characteristics of these databases is vital to the success of gene set analysis. Due to the sheer size of the gene set databases, a comprehensive qualitative evaluation of them is impractical. In this paper, we quantitatively study several well-established gene set databases. We propose and use a quantitative measure for assessing the similarity between gene set databases. Also, we introduce presence score, for quantifying the degree to which a given gene is represented in a database, and permeability score, for quantifying the degree to which genes in a given list co-occur in the gene sets of a database. A maximum achievable coverage score is defined based on the permeability score. Using the maximum achievable coverage score, we propose a methodology to statistically determine whether a phenotype of interest is well-represented in a given database. To study the effect of the choice of gene set database on the result of gene set analysis and show the utility of the maximum achievable coverage score, we conduct an experiment using two widely used gene set analysis methods and three expression datasets. The results suggest that the choice of gene set database might profoundly affect the outcome of the analysis. Also, our findings show that the permeability score and maximum achievable coverage can be used to guide the selection of an appropriate gene set database for a given study.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call