MLkit: A machine-learning-powered automatic workflow for classification of cancer samples.

Qingyuan Li,Ji He

doi:10.1200/jco.2021.39.15_suppl.e13583

Abstract

e13583 Background: In the era of data explosion, precision classification of cancer samples based on multi-dimensional medical data provides more insights into disease mechanism and useful hints on clinical treatment associated with tissue of origin, recurrence tendency and prognostic of chemotherapy or immunotherapy. We developed an automatic workflow MLkit to select features from large-scale multi-dimensional medical data and conduct classification through various machine learning techniques. Methods: MLkit is an automatic and one-stop workflow for classification of cancer samples with four modules: preprocessing (missing data remove or imputation and feature standardization), feature selection (unsupervised multi-statistics and supervised multiple machine estimators with recursive feature elimination and cross-validation), modeling (hyper-parameter, performance evaluation and probability calibration) and prediction. Most of current machine learning algorithms were implemented in this workflow, including linear model (logistic regression, ridge regression and stochastic gradient descent), ensemble model (gradient boosting, random forest, xgboost, catboost, lightgmb and stacking), support vector kernel (linear and non-linear), naive Bayes, k-nearest neighbors and multi-layer perceptron neural network. To evaluate the performance of this workflow, we utilized it to fit a model used for prediction of tissue of origin based on 450K DNA methylation data of 2,210 samples from lung, kidney and breast cancer patients collected in TCGA. Results: MLkit performed well in the prediction of tissue of origin for independent validation sets of cancer patients with stable feature selection, automatic hyper-parameters and efficient probability calibration, in which the model achieved AUCs ranged from 0.85 to 0.96. In addition, we also utilized this workflow on extensive real world data and most of results showed superior accuracy and stable performance. Conclusions: MLkit facilitates automated and one-stop classification of cancer samples using machine learning algorithms. It can be operated with simple command line, making it accessible to a broad range of users. The well performance of this workflow based on multi-dimensional medical data can help to improve the discovery of tumor biomarker and optimize clinical follow-up and therapeutic treatment for cancer patients.

Full Text