Large Language Models for Electronic Health Record De-Identification in English and German

Samuel Sousa,Michael Jantscher,Mark Kröll,Roman Kern

doi:10.3390/info16020112

Samuel Sousa, Michael Jantscher + Show 2 more

https://doi.org/10.3390/info16020112

Copy DOI

Export

Save

Cite

Journal: Information	Publication Date: Feb 6, 2025
License type: CC BY 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs.

Full Text