Abstract

Abstract Background: Predicting tissue of origin (ToO) using clinical and molecular data improves diagnostic accuracy up to 95% in patients with Cancer of Unknown Primary (CUP). It is hypothesized that better treatment stratification of CUP patients using omics and machine learning (ML) classifiers may improve prognosis. Methods: We used publicly available whole exome somatic mutation data from 4733 primary solid tissue samples, across 11 tumor types from the TCGA database, and employed a ML classifier to predict their ToO. We used 5 sets of modeling features: 1) Non-silent somatic mutation burden of 230 cancer-related genes 2) Frequency of SNP substitution type 3) Trinucleotide mutation frequency 4) Copy number variation of the 230 cancer-related genes 5) Presence of hotspot mutations. We trained a Support Vector Machine on a training subset (80% of samples) and tuned the hyperparameters maximizing a 5-fold cross-validation F1-score. We then tested the model performance on a validation subset (20% of samples) and on a limited (n=6) dataset of metastatic samples present in the TCGA database. Results: On the primary tumor validation set, we achieved an average AUC of 0.98(std: 0.02) and top 1, top 2 and top 3 accuracies of 80%(std: 0.11), 90%(std: 0.08) and 95%(std: 0.04) respectively, across 11 tumor types. The classification accuracy plateaus after ~300 samples, suggesting further data collection may benefit low performing tumor types. The 2 worst performers: esophageal and stomach cancers were mostly misclassified with colorectal cancers, reflecting their relative similarity. On metastatic samples (n=6) the model achieved a 67% accuracy, this is work in progress. Conclusion: Our study confirms the potential for a DNA-based machine learning approach to improve prognosis in CUP patients by aiding diagnosis of ToO. To this end, we plan to take this study further by applying this approach to large, independent datasets derived from metastatic samples and liquid biopsies from CUP patient cohorts. Table 1. Top_1_acc Top_2_acc Top_3_acc Precision Recall F1_score Training_size Breast 0.88 0.97 0.99 0.88 0.84 0.86 756 Colorectal 0.85 0.97 0.98 0.83 0.83 0.83 460 Oesophagus 0.5 0.75 0.89 0.51 0.5 0.51 144 Liver 0.83 0.92 0.96 0.82 0.89 0.85 288 Lung 0.86 0.91 0.94 0.98 0.81 0.88 396 Ovary 0.87 0.94 0.97 0.77 0.9 0.83 312 Pancreas 0.73 0.79 0.85 0.62 0.7 0.66 132 Prostate 0.88 0.94 0.98 0.75 0.88 0.81 380 Sarcoma 0.78 0.84 0.96 0.8 0.82 0.81 180 Stomach 0.73 0.91 0.95 0.68 0.58 0.62 340 Endometrial 0.9 0.97 0.99 0.86 0.87 0.87 400 Mean 80.09% 90.09% 95.09% 0.77 0.78 0.78 344 Top_1_acc_n Top_2_acc_n Top_3_acc_n n samples Metastatic breast 2 2 2 2 Metastatic prostate 0 0 0 1 Metastatic pancreas 0 0 0 1 Metastatic sarcoma 1 1 1 1 Metastatic oesophagus 1 1 1 1 Mean accuracy 67% 67% 67% Citation Format: Andrea Giorni, Prabu Sivasubramiam, Aidan Kubeyev, Jordan Laurie, Luiz Silva, Matthew Foster, Uzma Asghar, Matthew Griffiths. Using machine learning to predict tissue of origin from somatic mutation features. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5429.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call