Abstract While progression to metastatic disease is the main cause of cancer death, little is known about the genomic mechanisms that drive metastasis. Rapidly growing clinical genomic data sets have the potential to identify genomic biomarkers of cancer metastasis, however, manual curation of clinical data is quickly emerging as a bottleneck. To overcome this challenge, we have developed a natural language processing (NLP) pipeline to identify organs affected by metastasis from radiology reports of patients with cancer. To develop our NLP models, we leveraged the AACR GENIE Biopharma Collaborative lung and colorectal cancer datasets generated in part at Memorial Sloan Kettering Cancer Center (MSK), containing curated labels of ten metastatic disease sites derived from 31,445 corresponding free-text radiology reports (2,310 patients). Using these data, we trained three machine learning models for identifying metastatic events from clinical text, using logistic regression, convolutional neural networks (CNN), and Bidirectional Encoder Representations from Transformers (BERT). We split patients into a training set (80% of patients) and validation set (20%). The BERT model yielded superior performance across evaluation metrics, with an average per metastatic disease site area under the receiver operating characteristic curve (AUC) of 0.981, average accuracy of 97.3%, macro-average precision/recall of 85.1/85.6, and micro-average precision/recall of 87.5/89.6. We applied our method to radiology reports from 52,000 patients with tumors prospectively profiled using the MSK-IMPACT clinical sequencing cohort. A comparison with the MSK-MET cohort, which contains metastatic events derived from billing codes in a subset of 25,000 patients, showed strong concordance (79.7% of metastatic events matched), with the NLP-based method identified an average of 1.4 additional metastatic sites per patient, an expected result given the incomplete nature of the billing code data. Analyzing genomic and clinical data in this cohort, we confirmed that chromosomal instability, as inferred by the fraction of genome altered (FGA), is strongly correlated with metastatic burden (defined as the number of distinct organs affected by metastases) in several tumor types, including prostate adenocarcinoma, lung adenocarcinoma and HR-positive breast ductal carcinoma, and we identified this trend in 10 additional cancer types not previously identified, including lobular HR-positive breast carcinoma and esophageal adenocarcinoma.We demonstrate that mining of electronic health records can be used to extract rich, structured clinical information. Our models, applied at scale, offer a unique resource for the investigation of the biological basis for metastatic spread. We hope our automated clinical data extractions can enable further large-scale studies of associations between genomic biomarkers and metastatic behavior. Citation Format: Anisha Luthra, Karl Pichotta, Brooke Mastrogiacomo, Samantha McCarthy, Steven Maron, Jianjiong Gao, Justin Jee, Christopher J. Fong, Nikolaus Schultz. A.I.-assisted clinical data curation to determine genomic biomarkers of cancer metastasis [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1158.