Graph coloring for extracting discriminative genes in cancer data.

Mohamed A Mahfouz,Juan A Nepomuceno

doi:10.1111/ahg.12297

Mohamed A Mahfouz, Juan A Nepomuceno

Open Access

https://doi.org/10.1111/ahg.12297

Copy DOI

Abstract

The major difficulty of the analysis of the input gene expression data in a microarray-based approach for an automated diagnosis of cancer is the large number of genes (high dimensionality) with many irrelevant genes (noise) compared to the very small number of samples. This research study tackles the dimensionality reduction challenge in this area. This research study introduces a dimension-reduction technique termed graph coloring approach (GCA) for microarray data-based cancer classification based on analyzing the absolute correlation between gene-gene pairs and partitioning genes into several hubs using graph coloring. GCA starts by a gene-selection step in which top relevant genes are selected using a biserial correlation. Each time, a gene from an ordered list of top relevant genes is selected as the hub gene (representative) and redundant genes are added to its group; the process is repeated recursively for the remaining genes. A gene is considered redundant if its absolute correlation with the hub gene is greater than a controlling threshold. A suitable range for the threshold is estimated by computing a percentage graph for the absolute correlation between gene-gene pairs. Each value in the estimated range for the threshold can efficiently produce a new feature subset. GCA achieved significant improvement over several existing techniques in terms of higher accuracy and a smaller number of features. Also, genes selected by this method are relevant genes according to the information stored in scientific repositories. The proposed dimension-reduction technique can help biologists accurately predict cancer in several areas of the body.

Full Text