Abstract Background Aortic stenosis (AS) is one of the most common heart valve diseases in the Western World, with a prevalence of approximately 4% in patients over 70 years old. High quality observational data can provide insight into characteristics that define patient trajectories and inform the design of appropriately powered randomised trials. Objective The aim of this study is to develop a data pipeline to generate a large database containing information of an AS cohort in a hospital by analysing both structured and unstructured (free text) data in routinely collected electronic heath records (EHR). The unstructured data are processed using natural language processing (NLP, an artificial intelligence technique that allows machines to recognize clinical concepts in EHR with contextualisation and linked to SNOMED ontology). Methods The open-source CogStack retrieval system is used to collect clinical notes of AS patients. The natural language processing toolkit, MedCAT, is used to identify clinical concepts in the form of SNOMED terms in clinical notes. In addition to patient demographic data (including ethnicity and markers of social deprivation), a range of clinical characteristics, co-morbidities, medications and procedural interventions are extracted. Validation of the dataset was performed using 3 existing, independent data sources (structured echocardiogram reports, surgical aortic valve replacement (AVR) operation notes, transcatheter intervention (TAVI) reports). Results We identified 7,451 patients with AS from our EHR, of whom 5,754 had an echocardiogram at our centre. The mean duration of follow up was 5 years. Of those with an echocardiogram at our centre, we were able to extract data on AS severity in 91% and left ventricular function in 89%. Manual validation of procedural data suggests a sensitivity of 98% to detect TAVI and 97% to detect AVR procedures. Kaplan-Meier curves for automated detection of patient mortality are consistent with the known survival data for patients with AS. Conclusion We developed and validated an automated NLP-enabled data pipeline to identify and characterise a single retrospective centre cohort of AS patients from routinely collected EHR across the spectrum of severity, with long term mortality data. This resource has the potential to provide insights toward several outstanding questions relating to progression of AS, and the influence of ethnicity and social deprivation without the need for conventional dedicated databases. The EHR-based database also has the potential to guide the design of future randomised control trials.Data Pipeline
Read full abstract