Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data.

Md Ashad Alam,Hong-Wen Deng,Fokhrul Hossain,Md Ferdush Rahman,Mohammd Shahjaman,Enrique Hernandez-Lemus

doi:10.1371/journal.pone.0217027

Abstract

BackgroundGene shaving (GS) is an essential and challenging tools for biomedical researchers due to the large number of genes in human genome and the complex nature of biological networks. Most GS methods are not applicable to non-linear and multi-view data sets. While the kernel based methods can overcome these problems, a well-founded positive definite kernel based GS method has yet to be proposed for biomedical data analysis.Methods and findingsSince the kernel based methods on genomic information can improve the prediction of diseases, here we proposed a noble method, “kernel based gene shaving” which is based on the influence function of kernel canonical correlation analysis. To investigate the performance of the proposed method in comparison to state-of-the-art-method in gene saving, we analyzed extensive simulated and real microarray gene expression data set. The performance metrics including true positive rate, true negative rate, false positive rate, false negative rate, misclassification error rate, the false discovery rate and area under curves were computed for each methods. In colon cancer data analysis, the proposed method identified a significant subsets of 210 genes out of 2000 genes and suggestive superior performance compared with other methods. The proposed method can be applied to the study of other disease process where two view data is a common task.ConclusionsWe addressed the challenge of finding unique kernel based GS methods by using the influence function of kernel canonical correlation analysis. The proposed method has shown to have better performance than state-of-the-art-methods in gene saving and has identified many more significant gene interactions, suggesting that genes function in a concerted effort in colon cancer. In similar biomedical data analysis, kernel based methods could be applied to select a potential subset of genes. The positive definite kernel based methods can overcome the non-linearity problem and improve the prediction process.

Highlights

Gene shaving (GS), to identify significant subsets of the genes, is an important research area in the analysis of DNA microarray gene expression data for biomedical discovery
Since the kernel based methods on genomic information can improve the prediction of diseases, here we proposed a noble method, “kernel based gene shaving” which is based on the influence function of kernel canonical correlation analysis
We addressed the challenge of finding unique kernel based GS methods by using the influence function of kernel canonical correlation analysis

Summary

Introduction

Gene shaving (GS), to identify significant subsets of the genes, is an important research area in the analysis of DNA microarray gene expression data for biomedical discovery. GS methods aim to remove redundant and irrelevant genes so that performing in supervised learning will be more accurate [1, 2] It leads to gene discovery relevant for a particular target annotation and contributes to better medical diagnosis and prognosis. The selected genes using GS play an important role in the gene expression data analysis since they can differentiate samples from different populations [3,4,5,6]. Despite their successes, these studies are often hampered by their relatively low reproducibility, nonlinearity and multi-view data. While the kernel based methods can overcome these problems, a well-founded positive definite kernel based GS method has yet to be proposed for biomedical data analysis

Objectives

Methods

Results

Discussion

Conclusion