Abstract

Abstract Genetic risk scores predict future outcomes based on germline genetics. Human germline genetics are usually characterized by a series of millions of SNP's. Polygenic risk scores estimate risk from a linear combination of these SNPs, but non-linear may also contribute. Modern machine learning algorithms excel at identifying differences in these non-linear combinations between two groups. However, these algorithms require many more subjects than features (SNPs). Thus, these algorithms can't be used to recognize genetic differences between two populations, unless one has millions of subjects. We developed a representation of the human genome that requires only dozens of numbers, each representing a measure of the length of a chromosome. We used this representation and machine learning methods to test two hypotheses related to breast cancer. We test whether any distinguishable genetic differences exist between (1) women who develop breast cancer and women who do not develop it and (2) women who have a recurrent breast tumor and those who do not. To test our hypotheses, we used data from UK Biobank, for recurrence, and the NIH All of Us datasets, for occurrence. We computed a set of numbers representing the chromosome scale length variation from germline DNA for each woman in the dataset. For each test we constructed a dataset that consisted of women with the desired trait (TRUE) and an equal number of age-matched participants that did not have the desired trait (FALSE). We used the H2O AI platform in conjunction with R statistical computing environment to train and test machine learning models to distinguish between the two classes in each dataset. To quantify the performance of the developed models, we calculated evaluation metrices of the best ML model on an unseen test data set. We compared the AUC of these models to a control for each dataset, in which we randomly scrambled the TRUE/FALSE labels. The UK Biobank has 488,377 patients with genetic data. There are 13968 patients within the dataset that have been diagnosed with breast cancer at least once through the study time. Among them, 489 patients have had breast cancer recurrence. Our machine learning model assigns a score to each patient based on their germ line genetics. We found that patients ranked by this score in the highest quintile are approximately 2.36 times as likely to have breast cancer recurrence compared to the lowest quantile. For the occurrence model, we assessed ALL of US genetic data which consists of 98,600 participants with whole genome sequencing data. There are 8260 female participants that have been diagnosed with malignant neoplasm of breast. This work lay basis for future investigations on big genetic data. We found a small genetic difference between women who have recurrent breast tumors and those who do not have recurrent tumors. Citation Format: Yasaman Fatapour, James Brody. Using chromosomal-scale length variation to predict breast cancer occurrence and recurrence with machine learning [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 772.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call