De-identification Methods Research Articles

BackgroundThe increased use and adoption of Electronic Health Records (EHR) causes a tremendous growth in digital information useful for clinicians, researchers and many other operational purposes. However, this information is rich in Protected Health Information (PHI), which severely restricts its access and possible uses. A number of investigators have developed methods for automatically de-identifying EHR documents by removing PHI, as specified in the Health Insurance Portability and Accountability Act “Safe Harbor” method.This study focuses on the evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in our clinical notes; and when new methods are needed to improve performance.MethodsWe installed and evaluated five text de-identification systems “out-of-the-box” using a corpus of VHA clinical documents. The systems based on machine learning methods were trained with the 2006 i2b2 de-identification corpora and evaluated with our VHA corpus, and also evaluated with a ten-fold cross-validation experiment using our VHA corpus. We counted exact, partial, and fully contained matches with reference annotations, considering each PHI type separately, or only one unique ‘PHI’ category. Performance of the systems was assessed using recall (equivalent to sensitivity) and precision (equivalent to positive predictive value) metrics, as well as the F2-measure.ResultsOverall, systems based on rules and pattern matching achieved better recall, and precision was always better with systems based on machine learning approaches. The highest “out-of-the-box” F2-measure was 67% for partial matches; the best precision and recall were 95% and 78%, respectively. Finally, the ten-fold cross validation experiment allowed for an increase of the F2-measure to 79% with partial matches.ConclusionsThe “out-of-the-box” evaluation of text de-identification systems provided us with compelling insight about the best methods for de-identification of VHA clinical documents. The errors analysis demonstrated an important need for customization to PHI formats specific to VHA documents. This study informed the planning and development of a “best-of-breed” automatic de-identification application for VHA clinical text.

In the normal course of activity, pathologists create and archive immense data sets of scientifically valuable information. Researchers need pathology-based data sets, annotated with clinical information and linked to archived tissues, to discover and validate new diagnostic tests and therapies. Pathology records can be used for research purposes (without obtaining informed patient consent for each use of each record), provided the data are rendered harmless. Large data sets can be made harmless through 3 computational steps: (1) deidentification, the removal or modification of data fields that can be used to identify a patient (name, social security number, etc); (2) rendering the data ambiguous, ensuring that every data record in a public data set has a nonunique set of characterizing data; and (3) data scrubbing, the removal or transformation of words in free text that can be used to identify persons or that contain information that is incriminating or otherwise private. This article addresses the problem of data scrubbing. To design and implement a general algorithm that scrubs pathology free text, removing all identifying or private information. The Concept-Match algorithm steps through confidential text. When a medical term matching a standard nomenclature term is encountered, the term is replaced by a nomenclature code and a synonym for the original term. When a high-frequency "stop" word, such as a, an, the, or for, is encountered, it is left in place. When any other word is encountered, it is blocked and replaced by asterisks. This produces a scrubbed text. An open-source implementation of the algorithm is freely available. The Concept-Match scrub method transformed pathology free text into scrubbed output that preserved the sense of the original sentences, while it blocked terms that did not match terms found in the Unified Medical Language System (UMLS). The scrubbed product is safe, in the restricted sense that the output retains only standard medical terms. The software implementation scrubbed more than half a million surgical pathology report phrases in less than an hour. Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes. Scrubbing and deidentification methods allow pathologists to create and use large pathology databases to conduct medical research.

De-identification Methods Research Articles

Articles published on De-identification Methods

Combining knowledge- and data-driven methods for de-identification of clinical narratives

Preparing a collection of radiology examinations for distribution and retrieval.

A De-identification method for bilingual clinical texts of various note types.

Multidimensional Suppression for K-Anonymity in Public Dataset Using See5

Beyond the DICOM Header: Additional Issues in Deidentification

PS3-13: Re-Identification Risk Associated with Sharing Linked Genomic and Phenotypic Data from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH)

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text

Lessons Learned from Development of De-identification System for Biomedical Research in a Korean Tertiary Hospital

Correction: Lessons Learned from Development of De-identification System for Biomedical Research in a Korean Tertiary Hospital

BoB, a best-of-breed automated text de-identification system for VHA clinical documents

Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents

De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

A systematic review of re-identification attacks on health data.

Secure management of biomedical data with cryptographic hardware.

Avoiding Disclosure of Individually Identifiable Health Information

Evaluating the State-of-the-Art in Automatic De-identification

Concept-match medical data scrubbing. How pathology text can be used in research.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

De-identification Methods Research Articles

Articles published on De-identification Methods

Combining knowledge- and data-driven methods for de-identification of clinical narratives

Preparing a collection of radiology examinations for distribution and retrieval.

A De-identification method for bilingual clinical texts of various note types.

Multidimensional Suppression for K-Anonymity in Public Dataset Using See5

Beyond the DICOM Header: Additional Issues in Deidentification

PS3-13: Re-Identification Risk Associated with Sharing Linked Genomic and Phenotypic Data from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH)

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text

Lessons Learned from Development of De-identification System for Biomedical Research in a Korean Tertiary Hospital

Correction: Lessons Learned from Development of De-identification System for Biomedical Research in a Korean Tertiary Hospital

BoB, a best-of-breed automated text de-identification system for VHA clinical documents

Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents

De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

A systematic review of re-identification attacks on health data.

Secure management of biomedical data with cryptographic hardware.

Avoiding Disclosure of Individually Identifiable Health Information

Evaluating the State-of-the-Art in Automatic De-identification

Concept-match medical data scrubbing. How pathology text can be used in research.