Guidance on modeling circulating miRNA to distinguish multiple cancer types by an observation of large-scale open data.

Jason Chia-Hsun Hsieh,Ko-Han Lee,Yu-Chuan Chang,Tsung-Ting Hsieh

doi:10.1200/jco.2023.41.16_suppl.e13537

Abstract

e13537 Background: Cell-free miRNAs (cf-miRNA), circulated in body fluids such as plasma or serum, have shown their ability to detect, diagnose, and monitor cancers. Combining machine learning (ML) technology with these biomarkers facilitates early detection of cancers, which increases the accuracy of clinical decisions and empowers people to take control of their health status. However, the data of the cf-miRNAs has characteristics, which will affect the results of ML. Therefore, this study tries to expound on them in different aspects and to build a reasonable model. Methods: We downloaded large-scale datasets of the platform GPL21263 from the Gene Expression Omnibus for modeling experiments. We curated 8,174 subjects with 2,565 miRNA targets across 7 cancer types of different cf-miRNA-based cancer studies. Moreover, we used principal component analysis (PCA) to observe the datasets, recursive feature elimination (RFE) for feature selection, and tree-based algorithms to build the prediction model. Results: The characteristics of the cf-miRNA we like to share are: (1) Cancer subjects express more cf-miRNAs than control subjects. In the control group, there were 294 and 327 miRNAs with missing rates under 50% and 25%, respectively. In contrast, there were 395 and 485 miRNAs with the same thresholds in the cancer group; (2) Dividing subjects into cancer and control groups is simpler than distinguishing specific cancer types. In the PCA, the average Euclidean distance between the control group and each cancer type is 98.48, while it is 20.23 within each cancer type; (3) For obtaining cancer-specific biomarkers, we suggested that other non-target cancer subjects should be considered as negative controls. We modeled 7 cancer types and compared the proportion of cancer-specific biomarkers, not selected by any other models. The proportion increased from 30.0% to 57.8% after we added other non-target cancer subjects to the control group; next, we focus on multi-cancer modeling: (4) We need at least 400 samples to distinguish seven cancer types. In our experiment, we kept increasing the size of the training data, a hundred at a time. As data was added to the model, the accuracy increased but plateaued after adding 400 samples; (5) Based on RFE, 120 miRNAs is a reasonable number to distinguish seven cancers. Moreover, we found some of these miRNAs are only expressed in cancer subjects. We might lose this kind of biomarkers if we filtered out them by the missing rate; (6) The 10-fold cross validation accuracy of the multi-cancer model can achieve 93.0% using the gradient-boosted trees algorithm. Conclusions: In this study, we showed the guidance for modeling miRNAs in different aspects, including labeling strategy, sample and feature sizes, and the high-accuracy multi-cancer model we can achieve. We hope this guidance would inspire researchers on cf-miRNA-related machine learning applications.

Full Text