Abstract Introduction Cancer type is determined through tumor morphology, aided by immunohistochemical staining. The development of machine learning (ML) models using histology slides has powered the image-based prediction of the site of origin in cancer of unknown primary (CUP). Here, we used ML on proteomic data to predict cancer types and tissue of origin from a sample cohort consisting of 1,277 human tissue samples spanning 44 cancer types. The training proteome datasets included two independent sets of proteomes acquired from a pan-cancer cell line collection and a subset of the tissue cohort for online ML. Methods All samples were processed using data-independent acquisition mass spectrometry (DIA-MS). Two proteomic profiles from the pan-cancer cell line cohort were generated using two independent sample preparation methods. These were normalized by Combat and merged by averaging the protein abundance, yielding a single training set (D1) with 975 cell lines and 9,688 proteins. Similary, 1,277 tissue samples were processed by DIA-MS, quantifying 9,501 proteins. Celligner was used to align the cell lines (D1) with the tissue cohort. Half of the tissue proteomes were used as a second training set (D2) for online ML and a hold-out test set was constructed by taking the other half of the tissue cohort (T1). A multinomial logistic regression was used to predict cancer and tissue types. Top-k accuracy, as the evaluation metric, computes how often the correct cancer and tissue type class is among the top k classes predicted. Results As a proof of concept, we defined six cancer types (adenocarcinoma, sarcoma, squamous carcinoma, lymphoma, melanoma and small cell carcinoma) and seven adenocarcinoma tissues of origin (breast, colorectal, liver, lung, ovary, stomach/esophagus and pancreas) for an ML experiment. We learned a classifier using the cell lines (D1) as the baseline training set, and consecutively added 10% of D2 to D1 for online ML. We tested the baseline model and each subsequent new model on the test set T1. We observed a monotonic performance increase from 0.89 (baseline; Top-1 accuracy) to 0.97 (all D2 were used) when predicting the six cancer types. We observed an analogous trend when predicting the seven tissue types (from 0.64 to 0.84). These results suggest that cancer cell lines can be used to predict cancer type and adenocarcinoma tissue of origin. Conclusion Our proteomic-based ML model can predict cancer type and adenocarcinoma tissue of origin in concordance with existing histopathological classification. It can also assign multiple probabilities to tumor type and tissue of origin, potentially enabling the classification of CUP in future work. By adding tissue samples stepwise to the existing model, its predictive performance can be further enhanced. This reflects a real-world knowledgebase that will continue to increase in predictive power as additional data are added. Citation Format: Zhaoxiang Cai, Zainab Noor, Adel T. Aref, Emma L. Boys, Dylan Xavier, Natasha Lucas, Steven G. Williams, Jennifer M. Koh, Rebecca C. Poulos, Peter G. Hains, Phillip J. Robinson, Rosemary Balleine, Roger R. Reddel, Qing Zhong. Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5391.
Read full abstract