Abstract

PURPOSEElectronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools.METHODSWe focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research.RESULTSKey words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors.CONCLUSIONWe have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Highlights

  • An electronic health record (EHR) is a digital form of a patient’s medical history, making real-time, patientcentered information available instantly and securely to authorized users

  • The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease

  • Our current work focuses on predefined treatments as a single group, “pharmacotherapy,” the gold standard data set with built-in training and validation cases established is readily expandable to specific drugs and regimens; the RECIST tools to be developed on the basis of the data set will be applicable to solid tumors

Read more

Summary

Introduction

An electronic health record (EHR) is a digital form of a patient’s medical history, making real-time, patientcentered information available instantly and securely to authorized users. Research dealing with unstructured clinical notes relevant to patient care and outcome, such as response to therapy, had been ineffectively conducted before the era of artificial intelligence (AI) techniques, due to the complexity of data extraction and accurate annotation. These challenges are being surmounted with the application of AI techniques (eg, clinical natural language processing [NLP] tools and machine learning)[1,2]; as a consequence, EHRs are gradually being used to facilitate and accelerate research relevant to patient care.[3]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call