Abstract

Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.

Highlights

  • Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets

  • Pair-wise gene evaluation has been implemented in several popular algorithms, including top scoring pair (TSP)[8,9], top scoring genes (TSG)[2], and doublets[7], which all compare expression values of the same sample between two different genes

  • Let X and Y be two independent, random variables and Y is binarized with a median, maximal information coefficient (MIC)(X; Y) = 0.1702 ± 0.0292

Read more

Summary

Results

Generality of MIC(X1; X2; Y) according to simulation analysis. If X1 and X2 are statistically independent of Y, MIC(X1; X2; Y) should be close to 0. Each reference method ranks the top 200 genes (Top200s) for each dataset (Top200s are shown in the Supplementary Material Table S1-S3). We can observe significant overlaps between the Top 200s selected by the four reference methods, as shown in Figs 6, 7 and 8 This indicates that a considerable number of similar informative genes can be detected by these reference methods. MRMR, SVM-RFE and TSG are not individual-gene-filter methods; the Top200s selected by them have considerable similarities to the Top200s selected by MIC(X; Y). This indicates that these methods can efficiently discover genes that are individually discriminant, but not specific to the genes have pair-wise synergy effects.

Dataset Prostate Lung DLBCL
Discussion
Adrenal adenoma
Validation accuracy Validation MCC
Author Contributions
Additional Information
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.