A privacy-preserving distributed filtering framework for NLP artifacts

Md Nazmus Sadat,Md Momin Al Aziz,Xiaoqian Jiang,Noman Mohammed,Serguei Pakhomov,Hongfang Liu

doi:10.1186/s12911-019-0867-z

Abstract

BackgroundMedical data sharing is a big challenge in biomedicine, which often hinders collaborative research. Due to privacy concerns, clinical notes cannot be directly shared. A lot of efforts have been dedicated to de-identifying clinical notes but it is still very challenging to accurately locate and scrub all sensitive elements from notes in an automatic manner. An alternative approach is to remove sentences that might contain sensitive terms related to personal information.MethodsA previous study introduced a frequency-based filtering approach that removes sentences containing low frequency bigrams to improve the privacy protection without significantly decreasing the utility. Our work extends this method to consider clinical notes from distributed sources with security and privacy considerations. We developed a novel secure protocol based on private set intersection and secure thresholding to identify uncommon and low-frequency terms, which can be used to guide sentence filtering.ResultsAs the computational cost of our proposed framework mostly depends on the cardinality of the intersection of the sets and the number of data owners, we evaluated the framework in terms of these two factors. Experimental results demonstrate that our proposed method is scalable in various experimental settings. In addition, we evaluated our framework in terms of data utility. This evaluation shows that the proposed method is able to retain enough information for data analysis.ConclusionThis work demonstrates the feasibility of using homomorphic encryption to develop a secure and efficient multi-party protocol.

Highlights

Medical data sharing is a big challenge in biomedicine, which often hinders collaborative research
Manual approaches to identify ProtectedHealth Information (PHI) are prone to mistakes (Neamatullah et al [2] shows the recall of 14 clinicians to detect 130 clinical notes varied from 0.63 to 0.94) and they are expensive (e.g., ~$50/h to read and label 20 k words/hour in de-identifying MIMIC II database [3])
An open-source implementation of our proposed framework is available at GitHub. Experimental results It is evident from the description of our proposed method that the runtime mostly depends on the cardinality of the intersection of the sets and the number of data owners

Summary

Introduction

Medical data sharing is a big challenge in biomedicine, which often hinders collaborative research. Clinical notes represent an indispensable component of electronic health records (EHRs), which contain important information (such as symptoms and medical history) that structured data might not cover. Sharing clinical notes can promote research, improve healthcare services, and contribute to clinical decision support [1]. It has been a very challenging task to de-identify such data to mitigate the privacy risks. Health Information (PHI) defined in the HIPAA safe harbor method. This is done through the detection and scrubbing of 18 specific categories of PHIs including name, social security number, dates, etc. Berman [5] developed a concept matching algorithm that steps through confidential pathology text to replace medical terms matching standard nomenclature code with a synonymous term while keeping the high

Objectives

Results

Discussion

Conclusion