C3-4: An Algorithm to Combine Machine Learning and Structured Data to Automate De-identification of Clinical Text

D.-T Tran,D Carrell,S Halgrim

doi:10.3121/cmr.2014.1250.c3-4

Abstract

Background/AimsClinical text is an important resource for research. To maintain patient privacy when researching this text, we use de-identification. The hiding in plain sight (HIPS) method is promising; it replaces personally identifiable information (PII) with realistic surrogates so any remaining real PII would be hard to distinguish from the fake information. However, there remain some challenges with HIPS, such as overlooked PII. We explored these challenges and hypothesized that we could find more PII by combining structured data with a machine learning algorithm.MethodsThe machine learning de-identification software we used, developed by MITRE, is the MITRE Identification Scrubber Toolkit (MIST). Trained chart abstractors annotated Family Practice notes with the following PII types: address, age, date, provider name, email, IP address, consumer number, organization name, other id, phone, patient name, room id, social security number, and URL address. Structured data included in this experiment are patient’s address, age, date of birth, email, phone, consumer number, social security number, the visit provider name, visit date, and visit location. We queried this data from Clarity, a relational reporting database for Group Health’s electronic health record (EHR) system. Our first test experiment used MIST to train a model on 100 documents then tested on 10 notes. We reviewed the remaining PII and determined if they are available in the structured data.ResultsMIST’s precision was 0.93 and recall was 0.77. MIST left 13 leaks and incorrectly identified two instances of blood pressure numbers as dates. We can reduce 3 out of 13 leaks using structured data obtained for that visit note. The remaining leaks are: 3 locations, 1 age which belonged to the patient’s child, 1 visit date pre-EHR system, 3 pieces of historical visit data, and 2 mentions of ‘sheriff’ because the chart abstractor determined it to be an “other” PII type.ConclusionsOur results show that combining structured data with MIST can potentially improve de-identification of Group Health clinical text. Future work is to build an automated system that include historical visit data, use gender, race, and language to create more realistic surrogates and to evaluate the loss of important clinical information in de-identified documents.

Full Text