Abstract
Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.
Highlights
Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets
Pair-wise gene evaluation has been implemented in several popular algorithms, including top scoring pair (TSP)[8,9], top scoring genes (TSG)[2], and doublets[7], which all compare expression values of the same sample between two different genes
Let X and Y be two independent, random variables and Y is binarized with a median, maximal information coefficient (MIC)(X; Y) = 0.1702 ± 0.0292
Summary
Generality of MIC(X1; X2; Y) according to simulation analysis. If X1 and X2 are statistically independent of Y, MIC(X1; X2; Y) should be close to 0. Each reference method ranks the top 200 genes (Top200s) for each dataset (Top200s are shown in the Supplementary Material Table S1-S3). We can observe significant overlaps between the Top 200s selected by the four reference methods, as shown in Figs 6, 7 and 8 This indicates that a considerable number of similar informative genes can be detected by these reference methods. MRMR, SVM-RFE and TSG are not individual-gene-filter methods; the Top200s selected by them have considerable similarities to the Top200s selected by MIC(X; Y). This indicates that these methods can efficiently discover genes that are individually discriminant, but not specific to the genes have pair-wise synergy effects.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.