Abstract The ability to locate a tumor primary site for patients with a cancer of unknown primary (CUP) is a major obstacle in providing personalized therapeutic options and access to clinical trials. Despite the recent use of molecular-based tools to identify the tumor tissue-of-origin (TOO), overall survival for CUP patients remains low. Here, we present an AI-based tool that predicts the TOO by using genomic and transcriptomic data to classify CUP tumors into hierarchically-organized molecular subgroups. The TOO predictor was composed of DNA, RNA, and Consensus classifiers that were hierarchically organized with respect to molecular diagnosis with upper level clusters based on common molecular features reflecting similar cell of origin, and lower levels containing specific diagnoses for further classification. The ML-based DNA classifier was trained on a dataset of publicly available genomic data generated from 8,000 samples, and independently validated using more than 5,500 samples. The ML-based RNA classifier was trained on a dataset of publicly available transcriptomic data created from more than 10,100 samples with tumor- and normal-specific features for each cancer type, and independently validated using 20,000 samples. The Consensus classifier, combining outputs from both DNA and RNA algorithms, was trained on a dataset of genomic and transcriptomic data from 1,000 samples, and validated on an independent dataset of 2,000 samples. Each classifier contained features selected based on data analysis according to the weighted F1-score, and the best hyperparameters for the final model. The 3-classifier algorithm predicts TOO for 33 cancer types and subtypes belonging to solid neoplasms, independently of sample source, sequencing methods, and cohort. Validation of the Consensus classifier showed a higher accuracy (95% f1-score) compared to the DNA and RNA classifiers (79% and 93% f1-scores, respectively), along with a high sensitivity (95%), specificity (99%), and precision (96%), as it takes into account both genomic events and expression patterns. The TOO predictor was prospectively validated on approximately 298 clinical samples with a known diagnosis using all classifiers. The diagnosis was identified in 90% of clinical cases (295/297) by the Consensus classifier with 90% sensitivity and 94% precision. The call rate for the DNA and RNA classifiers was above 95%. Of note, sensitivity of the top 4 predicted diagnoses was > 90% for all 3 classifiers, and the calculated rule-out accuracy of the Consensus classifier was 97%. In conclusion, an ML-based algorithm was developed that utilizes genomic and transcriptomic data to accurately predict the TOO for CUP tumors. Utilizing the Consensus classifier after DNA and RNA classifiers helps to identify the TOO of the tumor with high specificity, which can guide precision oncology therapeutic options. Citation Format: Zoia Antysheva, Daria Kiriy, Anton Sivkov, Alexander Sarachackov, Alexandra Boyko, Naira Samarina, Nara Shin, Jessica H. Brown, Ivan Kozlov, Viktor Svekolkin, Alexander Bagaev, Nathan Fowler, Nikita Kotlov. An ML-based tool for predicting tissue of origin for cancer of unknown primary (CUP) based on genomic and transcriptomic data. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5405.
Read full abstract