Semi-supervised discovery of differential genes

Shigeyuki Oba,Shin Lshii

doi:10.1186/1471-2105-7-414

Shigeyuki Oba, Shin Lshii

Open Access

https://doi.org/10.1186/1471-2105-7-414

Copy DOI

Journal: BMC bioinformatics	Publication Date: Sep 18, 2006
Citations: 25	License type: CC BY 2.0

Affiliation: Nara Institute of Science and Technology

Abstract

BackgroundVarious statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels.ResultsWe assume a latent variable model for the expression of active genes and apply the optimal discovery procedure (ODP) proposed by Storey (2005) to the model. Our latent variable model allows gene significance scores to be applied to unsupervised and semi-supervised cases. The ODP framework improves detectability by sharing the estimated parameters of null and alternative models of multiple tests over multiple genes. A theoretical consideration leads to two different interpretations of the latent variable, i.e., it only implicitly affects the alternative model through the model parameters, or it is explicitly included in the alternative model, so that the interpretations correspond to two different implementations of ODP. By comparing the two implementations through experiments with simulation data, we have found that sharing the latent variable estimation is effective for increasing the detectability of truly active genes. We also show that the unsupervised and semi-supervised rating of genes, which takes into account the samples without condition labels, can improve detection of active genes in real gene discovery problems.ConclusionThe experimental results indicate that the ODP framework is effective for hypotheses including latent variables and is further improved by sharing the estimations of hidden variables over multiple tests.

Highlights

Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions
We found that there are generally two theoretically natural but different ways to deal with the hidden variables in the optimal discovery procedure (ODP) manner; namely, the estimated values of hidden variables are shared among multiple tests explicitly or implicitly through the model parameters
The ODP framework [6,7] had proposed a theoretically optimal way of sharing hypothetical models by sharing likelihood function based on maximum likelihood (ML) parameters

Summary

Introduction

Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. In many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, condition labels are sometimes hard to obtain due to physical, financial, and time costs In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels. The two most important ideas are (a) considering family-wise errors, and (b) sharing commonality among multiple tests. These ideas are not limited to microarray studies, but can be applied to current and future bioinformatics subjects. It is becoming clearer that we cannot discuss the significance of each objective gene based on microarray studies; rather, we should handle a set of hypotheses by controlling the family-wise errors

Objectives

Results

Discussion

Conclusion