Analysis of microarray gene expression data using information theory and stochastic algorithm

Narayan Behera

doi:10.1016/bs.host.2020.02.002

Abstract

Abstract The microarray gene expression data provides a simultaneous gene expression profile of thousands of genes for any biological process. Generally, a few key genes among the thousands of genes play dominant roles in a disease process. A computational approach to find these key genes is an important area of research in bioinformatics. A new computational approach is developed here to identify the candidate genes of a cancer process from microarray gene expression data. Gene clustering enables identification of co-expressed genes that play pivotal roles in specified biological conditions. Many algorithms exist for extracting this information but all have inherent limitations. This model is a hybrid of clustering algorithm and evolutionary computation. Evolutionary computation uses a genetic algorithm that utilizes the three biological principles of evolution, (namely, selection, recombination, and mutation), to solve an optimization problem. The interdependence measure between the genes is based on mutual information. The Euclidean genetic distance measure (differences of the gene expression values) is used in many conventional algorithms. The mutual information theory takes into account the similarity of the gene expression levels as well as positive and negative correlations between the genes while clustering them. The genes having higher interdependence measures are the top candidate genes responsible for cancer. These top genes are believed to be faulty genes that contain the most diagnostic information for a diseased state. An analysis is done on gastric cancer, colon cancer, and brain cancer microarray gene expression datasets. In comparison with many existing computational tools, the top candidate genes found by this evolutionary computational model, are able to classify the samples into cancerous and normal classes with higher accuracies. The new model creates more even-distribution of genes in the clusters and provides better accuracy in picking up the top candidate genes. Furthermore, the present computational tool is more coherent in clustering the genes across large gene expression numbers. This information-theoretic computational method can be potentially applied to the analysis of big data from other sources.

Full Text