Privacy protection of textual medical documents

Montserrat Batet,David Sanchez

doi:10.1109/noms.2014.6838361

Abstract

With the adoption of ITs, a large amount patient-related documents is compiled by healthcare organisations. Quite often, this data is needed to be released to third parties for research or business purposes. The inherent sensitivity of patient's information has brought to the definition of legislations to protect the privacy of individuals. To meet with these legislations, redaction or sanitization of patient-related documents is needed before their release. This is usually done manually, which is costly and time-consuming, or by means of ad-hoc solutions that just protect structured types of sensitive information (e.g. social security numbers), or that are based on removing sensitive terms, which hampers the utility of the output. In this paper, we propose an automatic sanitization method for textual medical documents that is able to protect sensitive terms and those that are semantically related, while retaining the utility of the output as much as possible. Different to redaction schemas, which are based on term removal, our method improves the utility of the protected output by replacing sensitive terms with appropriate generalisations retrieved from medical and general-purpose knowledge bases. Experiments conducted on highly sensitive documents and in coherency with current regulations on healthcare data privacy show promising results in terms of output's privacy and utility.

Full Text