Multiple-kernel learning for genomic data mining and prediction

Christopher M Wilson,Kaiqiao Li,Xiaoqing Yu,Pei-Fen Kuan,Xuefeng Wang

doi:10.1186/s12859-019-2992-1

Christopher M Wilson, Kaiqiao Li + Show 3 more

Open Access

https://doi.org/10.1186/s12859-019-2992-1

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Aug 15, 2019
Citations: 36	License type: open-access

Affiliation: Moffitt Cancer Center, Stony Brook University

Abstract

BackgroundAdvances in medical technology have allowed for customized prognosis, diagnosis, and treatment regimens that utilize multiple heterogeneous data sources. Multiple kernel learning (MKL) is well suited for the integration of multiple high throughput data sources. MKL remains to be under-utilized by genomic researchers partly due to the lack of unified guidelines for its use, and benchmark genomic datasets.ResultsWe provide three implementations of MKL in R. These methods are applied to simulated data to illustrate that MKL can select appropriate models. We also apply MKL to combine clinical information with miRNA gene expression data of ovarian cancer study into a single analysis. Lastly, we show that MKL can identify gene sets that are known to play a role in the prognostic prediction of 15 cancer types using gene expression data from The Cancer Genome Atlas, as well as, identify new gene sets for the future research.ConclusionMultiple kernel learning coupled with modern optimization techniques provides a promising learning tool for building predictive models based on multi-source genomic data. MKL also provides an automated scheme for kernel prioritization and parameter tuning. The methods used in the paper are implemented as an R package called RMKL package, which is freely available for download through CRAN at https://CRAN.R-project.org/package=RMKL.

Highlights

Advances in medical technology have allowed for customized prognosis, diagnosis, and treatment regimens that utilize multiple heterogeneous data sources
Multiple kernel learning (MKL) can construct non-linear classification without any parametric assumptions for a single or multiple data types
MKL may not suffer from overfitting because the final decision rule is based on a weighted average of Support vector machine (SVM) models

Summary

Results

Benchmark example In addition to accuracy, an important characteristic of MKL is the learning of kernel weights. To avoid the curse of dimensionality, we include the 65 top-ranked genes, based on p-value from testing for differences in mean expression for patients who survived more than 3 years and those who did not We used these 65 genes to conduct SVM with 10 fold cross-validation for many several radial kernels DALMKL tends to be the most accurate There are cases, such as ovarian cancer (OV), where SimpleMKL allocates weight more evenly across the gene sets and can achieve a significant increase in accuracy. The pan-cancer pathway analysis revealed multiple gene sets that carry important prognostic values Many pathways such as KRAS signaling, inflammatory response and spermatogenesis had nonzero kernel-based importance scores across many cancer types. We hope the finding will spur additional research into the role of these pathways in cancer development and prognosis especially spermatogenesis, which is less studied compared with other pathways in cancer

Conclusion