Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation.

Zhenyu Zhao,Buzhou Tang,Muyun Yang,Tiejun Zhao

doi:10.2196/17622

Abstract

BackgroundDeidentification of clinical records is a critical step before their publication. This is usually treated as a type of sequence labeling task, and ensemble learning is one of the best performing solutions. Under the framework of multi-learner ensemble, the significance of a candidate rule-based learner remains an open issue.ObjectiveThe aim of this study is to investigate whether a rule-based learner is useful in a hybrid deidentification system and offer suggestions on how to build and integrate a rule-based learner.MethodsWe chose a data-driven rule-learner named transformation-based error-driven learning (TBED) and integrated it into the best performing hybrid system in this task.ResultsOn the popular Informatics for Integrating Biology and the Bedside (i2b2) deidentification data set, experiments showed that TBED can offer high performance with its generated rules, and integrating the rule-based model into an ensemble framework, which reached an F1 score of 96.76%, achieved the best performance reported in the community.ConclusionsWe proved the rule-based method offers an effective contribution to the current ensemble learning approach for the deidentification of clinical records. Such a rule system could be automatically learned by TBED, avoiding the high cost and low reliability of manual rule composition. In particular, we boosted the ensemble model with rules to create the best performance of the deidentification of clinical records.

Highlights

BackgroundElectronic health records (EHRs) are rich resources for clinical research in which a large amount of medical knowledge is contained
On the popular Informatics for Integrating Biology and the Bedside (i2b2) deidentification data set, experiments showed that transformation-based error-driven learning (TBED) can offer high performance with its generated rules, and integrating the rule-based model into an ensemble framework, which reached an F1 score of 96.76%, achieved the best performance reported in the community
We proved the rule-based method offers an effective contribution to the current ensemble learning approach for the deidentification of clinical records

Summary

Introduction

BackgroundElectronic health records (EHRs) are rich resources for clinical research in which a large amount of medical knowledge is contained. Deidentification is conventionally processed manually, with crowd-sourced workers tagging the PHI and removing it. This would be prohibitively expensive in terms of manpower considering the existing large scale of the clinical corpus. To encourage innovations in this field, in 2006, 2014, and 2016, three deidentification shared tasks were organized by Informatics for Integrating Biology and the Bedside (i2b2). Deidentification of clinical records is a critical step before their publication. This is usually treated as a type of sequence labeling task, and ensemble learning is one of the best performing solutions. Under the framework of multi-learner ensemble, the significance of a candidate rule-based learner remains an open issue

Objectives

Methods

Results

Discussion

Conclusion