Introduction: Fundamental to management of patients with essential thrombocythaemia (ET) is assessment, and reduction of thrombotic risk. We present a machine learning approach to summarise patient electronic health records (EHR) to determine prevalence of cardiovascular comorbidities and risk factors. We then review use of the QRISK-3 score to assess cardiovascular risk. Methods: We used a natural language processing (NLP) pipeline to identify mentions of hypertension (HTN), hypercholesterolaemia (HC), diabetes mellitus (DM), smoking (SM), unspecified thrombosis (VTE), deep vein thrombosis (DVT), pulmonary embolism (PE), portal vein thrombosis (PVT), myocardial infarction (MI) and stroke (CVA) in EHR. CogStack is an information retrieval and extraction architecture incorporating structured and unstructured EHR components. Data extracted from CogStack was processed by a medical concept annotation toolkit (MedCAT). MedCAT was used to disambiguate and capture synonyms and acronyms for Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) concepts. Using deep learning MedCAT determined linguistic and grammatical context such as negation, subject, and temporality. The base MedCAT model was trained in an unsupervised manner on >18 million EHR documents and this was further fine-tuned through 500 clinician annotated haematology documents. MedCAT mapped mentions of relevant concepts to respective SNOMED-CT codes and total counts were aggregated and grouped by individual patient. Manual validations were performed and an optimizer was applied to convert counts to a binary state by applying a threshold, above which a patient's condition was inferred to be present (Fig. i). QRISK-3 is an advanced validated score incorporating age, ethnicity, body mass index and other cardiovascular risk factors to determine 10-year cardiovascular risk in people aged 25-84. Results: 12905 documents from 560 ET patients were reviewed (median 20 per patient, IQR 8-34). In the manual validation dataset (n=120), MedCAT achieved excellent real-world F1 scores (model accuracy) for most concepts (HTN 0.91, HC 0.81, DM 1.0, VTE 0.73, CVA 0.87 and MI 0.67). Using a threshold of >2 mentions to define a positive population; HTN was identified in 21.3% (119) of patients, DM in 4.6% (26), MI in 3.6% (20), CVA in 7.7% (43), VTE in 8% (45), DVT in 1.4% (8), PE in 1.8% (10), PVT in 1.3% (7) and positive smoking status in 6.6% (37). HC was identified in 9.6% (54) using a threshold >1. 52% (28) of patients with HC and 69.2% (18) of those with DM also had HTN. Obesity was not identified in any patients using this approach. Patients with a diagnosis of HTN were more likely to have CVA than those without (15:104 vs 28:413, p=0.03). Patients with HTN were also more likely to have VTE (13:106 vs 19:422, p=0.01). Of patients with CVA/MI; 58.1% (25) /55% (11) had this event pre or at diagnosis and 30.2% (13)/ 10% (2) while receiving cytoreductive therapy. QRISK-3 analysis was performed in 32 patients with prior thrombosis and baseline criteria to evaluate predictive value; then 137 patients classified as low or intermediate (LIM) risk and not receiving cytoreductive therapy. Mean QRISK-3 was 8 in the thrombosis group, validating its relevance, and 2.5 (p<0.0001, Fig. ii) in the LIM cohort. Using the recognised QRISK-3 score threshold of >7.5 to define a high-risk population, 5.1% (7) patients from the LIM group were reclassified as high-risk due to additional comorbidities relevant to QRISK-3 including HTN in 8% (11), migraine 7.3% (10), DM 2.2% (3), severe mental illness 2.9% (4) and antipsychotic medication 0.7% (1). Discussion: We describe a novel approach to cardiovascular risk assessment in patients with ET, incorporating machine learning, allowing large volume data analysis, and detailed risk assessment using QRISK-3 scoring. We provide a rare ‘real-world’ report on the prevalence of comorbidities in this group, confirming increased CVA and VTE in patients with HTN. A previous report of 891 patients with ET showed prevalence of 5% for CVA, 2% for MI and 4% for VTE, suggesting that detection rate using our approach is within acceptable limits (Carobbio et al., Blood, 2011). Finally, as a novel finding, we show that QRISK-3 scoring is predictive of increased thrombotic risk and identifies a small group of patients who should be considered high-risk and may benefit from cytoreductive therapy, that are not detected using standard approaches.
Read full abstract