Background: Thrombosis is a leading cause of early morbidity and mortality in patients (pts) with polycythemia vera (PV)1. Thrombosis risk informs treatment decisions in PV but current risk stratification relies upon two non-modifiable factors - Age(</≥ 60) and thrombosis history (hx)2 - thereby offering no potential to monitor changes in risk. We used a rich dataset and machine learning (ML) methods to identify clinically available data useful for dynamic prediction of individualized thrombosis risk in PV. Methods: Under Weill Cornell Medicine (WCM) institutional board approval, we identified in our research repository (RDR) 470 eligible pts with PV diagnosis (Dx) and annotated thrombosis outcomes, as previously described3. We queried electronic medical records using our RDR, Observational Medical Outcomes Partnership, Structured Query Language, and novel workflows implemented in R and Python, to automate extraction of all available clinical data for pts at every clinical visit. Over 2100 clinical parameters were available for ~8400 hematology clinic visits yielding ~1.4 million data elements. These data included all laboratory measures, molecular studies, pathology reports, vitals, medications and clinical hx. Data completeness varied across data types with 283 parameters available in >90% of visits for all pts. In total, 292 parameters with over 50% availability were assessed while others were excluded from analysis. Multivariate imputation by chained equations (MICE) was used to impute missing data. To address potential artifacts from imputation, all analyses were performed on 10 separately imputed datasets and results were aggregated. AutoGluon infrastructure was used to identify the best performing ML algorithms. We then used our dataset to train random forest (RF) ML models that classify, at each clinic visit, the likelihood of thrombosis within a year. Parameters were ranked by Gini feature importance to identify a practical clinical model comprised of a few core parameters (eg 3-7) and interrogated the performance of >2x105 of these models (every combination of top 20 parameters in groups of 3-7) compared to the full model. Model performance was evaluated using the F1 score, which is considered highly predictive when >0.8. Results: Of 470 PV pts, Dx at median age of 54 (range 20-94), 64 (14%) had a thrombosis hx at Dx (8.1% venous, 6.6% arterial). Over a median follow-up of 10 years (yr), 159 thromboses occurred in 115 pts (88 venous, 71 arterial). Cumulative incidence of thrombosis was non-linear, as previously appreciated1. Annual incidence rate (IR) of thrombosis was higher shortly after diagnosis (IR of 4.4% vs 1%, 2yr cutoff), and following a thrombotic event (IR of 9.7% vs 1.8%, 2 yr cutoff). Consistent with this observation, ML models identified time from Dx, and time from prior thrombosis among the top 20 most predictive parameters as assessed by Gini feature importance (Figure 1). Most predictive parameters also included risk factors previously implicated in PV thrombosis (age, blood counts, JAK2 allele burden) as well as known thrombosis risk factors under-recognized in PV (blood type4, BMI, creatinine), and parameters not previously thought to contribute to thrombosis risk (MCV, uric acid, LDH). The full ML model was highly predictive of thrombosis with an F1 of 0.82 (Figure 2). We then identified simplified models (combinations of 4-5 variables) that performed similarly to the full model with an F1 > 0.8. One example combines age, time since Dx, time since prior thrombosis, and BMI. These models are being validated against external datasets and an online risk calculator will be developed. Discussion: We used ML and deep datasets for unbiased identification of the clinical parameters most predictive of near-term thrombosis risk in PV pts. These models address dynamic changes to patient risk due to intrinsic factors (e.g. age, blood type), life events (e.g. time since Dx/thrombosis) and clinical changes (e.g. ANC, BMI) that can be used to tailor risk mitigation strategies. Validation of models with at least core variables of age, time since Dx, time since prior thrombosis, and BMI against external datasets would likely establish a universal model to dynamically predict individualized risk of thrombosis in PV pts. References 1 Hultcrantz. et al. Ann Intern Med. 2018 2 Barbui T, et al.Leukemia 2018 3 Abu-Zeinah G, et al. Leukemia 2021 4 Groot H, et al. Arterioscler Thromb Vasc Biol 2020 Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal