NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research.

Guo-Qiang Zhang,Guangming Xing,Samden D Lhatoo,Bilal Zonjy,Licong Cui,Shiqiang Tao,Jeno Mozes

doi:10.2196/medinform.4959

Abstract

BackgroundA unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for large multicenter studies, where thousands of research participants may be recruited from multiple sites. An important property of study identifiers is error tolerance (or validatable), in that inadvertent editing mistakes during their transmission and use will most likely result in invalid study identifiers.ObjectiveThis paper introduces a novel method called "Randomized N-gram Hashing (NHash)," for generating unique study identifiers in a distributed and validatable fashion, in multicenter research. NHash has a unique set of properties: (1) it is a pseudonym serving the purpose of linking research data about a study participant for research purposes; (2) it can be generated automatically in a completely distributed fashion with virtually no risk for identifier collision; (3) it incorporates a set of cryptographic hash functions based on N-grams, with a combination of additional encryption techniques such as a shift cipher; (d) it is validatable (error tolerant) in the sense that inadvertent edit errors will mostly result in invalid identifiers.MethodsNHash consists of 2 phases. First, an intermediate string using randomized N-gram hashing is generated. This string consists of a collection of N-gram hashes f 1, f 2, ..., f k. The input for each function f i has 3 components: a random number r, an integer n, and input data m. The result, f i(r, n, m), is an n-gram of m with a starting position s, which is computed as (r mod |m|), where |m| represents the length of m. The output for Step 1 is the concatenation of the sequence f 1(r 1, n 1, m 1), f 2(r 2, n 2, m 2), ..., f k(r k, n k, m k). In the second phase, the intermediate string generated in Phase 1 is encrypted using techniques such as shift cipher. The result of the encryption, concatenated with the random number r, is the final NHash study identifier.ResultsWe performed experiments using a large synthesized dataset comparing NHash with random strings, and demonstrated neglegible probability for collision. We implemented NHash for the Center for SUDEP Research (CSR), a National Institute for Neurological Disorders and Stroke-funded Center Without Walls for Collaborative Research in the Epilepsies. This multicenter collaboration involves 14 institutions across the United States and Europe, bringing together extensive and diverse expertise to understand sudden unexpected death in epilepsy patients (SUDEP).ConclusionsThe CSR Data Repository has successfully used NHash to link deidentified multimodal clinical data collected in participating CSR institutions, meeting all desired objectives of NHash.

Highlights

Unique study identifiers, or pseudonyms, are alphanumeric codes used in clinical and other research studies to uniquely identify a study participant without revealing in the identifiers any Personal Health Information (PHI) [1], such as name, full date of birth (DOB), and medical record number (MRN) [2]
The Center for sudden unexpected death in epilepsy patients (SUDEP) Research (CSR) Data Repository has successfully used N-gram Hashing (NHash) to link deidentified multimodal clinical data collected in participating CSR institutions, meeting all desired objectives of NHash. (JMIR Med Inform 2015;3(4):e35) doi:10.2196/medinform
A study identifier generated using NHash has a unique set of properties: (1) as a unique study identifier, it is a pseudonym serving the purpose of linking research data about a study subject for research purposes, (2) it can be generated automatically in a completely distributed and decentralized fashion, yet allowing data integration with virtually no risk for identifier collision, (3) it incorporates a set of cryptographic hash functions based on N-grams for its generation, which can be further encrypted if desired, using encryption techniques such as shift-encryption, and (4) it is validatable in the sense that inadvertent edit errors on NHash identifiers, during their use, will almost always result in invalid identifiers

Summary

Introduction

Pseudonyms, are alphanumeric codes used in clinical and other research studies to uniquely identify a study participant without revealing in the identifiers any Personal Health Information (PHI) [1], such as name, full date of birth (DOB), and medical record number (MRN) [2]. For a fictional study participant, Aaron Skotnica, with DOB 08/13/1956 and MRN 07172485, the unique study identifier could be a number such as 57, representing the 57th enrolled study subject. It could be a randomly generated number, such as 28262. A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. De-identification is a process in which PHI elements are eliminated or manipulated with the purpose of hindering the possibility of revealing PHI contained in the original dataset This involves removing all identifying data to create unlinkable data. One method of de-identification under HIPPA (called the Safe Harbor Method) used for the current study is when data have been stripped of 18 common identifiers found in patient names, geographic data, all elements of dates, telephone numbers, fax numbers, email addresses, social security numbers, or medical record numbers

Methods

Results

Discussion

Conclusion