Abstract

For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.

Highlights

  • For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality

  • The final gold standard OpenDeID corpus consists of 2,100 unique pathology reports of 1,833 unique cancer patients from four urban Australian hospitals

  • Most of the annotated Protected Health Information (PHI) entities belong to NAME category, followed by LOCATION, ID and DATE

Read more

Summary

Introduction

For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. Researchers manually redacted the Protected Health Information (PHI) in EHRs. It was reported that the average time required to manually de-identify a single clinical note (7.9 + /−6.1 PHI per note) was 87.3 + /−61 ­seconds[4]. A de-identification corpus is a large set of unstructured texts with PHI entities that has been manually annotated. The 2014 i2b2/UTHealth de-identification corpus contained a total of 1,304 longitudinal clinical narratives of 296 patients from USA In this corpus 28,872 PHI were annotated and classified into 6 PHI categories and 25 ­subcategories[17,18].Another corpus is 2016 CEGS N-GRID de-identification corpus of 1000 psychiatric notes from. Category NAME AGE CONTACT LOCATION DATE ID PROFESSION OTHER Total number of PHI entities Average number of PHI entities per report Standard deviation of PHI entities Total number of tokens Average number of tokens per report Standard deviation of tokens

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call