Abstract

Since its post-World War II inception, the science of record linkage has grown exponentially and is used across industrial, governmental, and academic agencies. The academic fields that rely on record linkage are diverse, ranging from history to public health to demography. In this paper, we introduce the different types of data linkage and give a historical context to their development. We then introduce the three types of underlying models for probabilistic record linkage: Fellegi-Sunter-based methods, machine learning methods, and Bayesian methods. Practical considerations, such as data standardization and privacy concerns, are then discussed. Finally, recommendations are given for organizations developing or maintaining record linkage programs, with an emphasis on organizations measuring long-term complications of disasters, such as 9/11.

Highlights

  • From its humble beginnings in post-World War II public health research, the field of “record linkage”—that is, the matching of records for unique entitiesacross one or more lists—has exploded into a multi-field research focus

  • The origins of record linkage as a field begin at the end of World War II; the original papers on record linkage related to family structure in the United States and elsewhere [1,2,3] and a population registry in Canada [4]

  • Current research topics related to these concerns revolve around privacy-preserving record linkage and understanding the bias introduced by the requirement for informed consent [26,27]

Read more

Summary

Introduction

From its humble beginnings in post-World War II public health research, the field of “record linkage”—that is, the matching of records for unique entities (typically people, but sometimes organizations, addresses, or something else)across one or more lists—has exploded into a multi-field research focus (see Figure 1). Several joint studies are being formulated to study pooled patient populations across cohorts. The reasons for this are both scientific and practical. As more data become available electronically and computational power improves, access to health data, at least from a technical point of view, has become easier. This is fortuitous as maintaining a large-scale research project over multiple decades among a trauma-exposed and aging population presents several challenges, chief among them attrition and reporting bias due to failing memories among respondents.

Methods
Data Combining Methods
Ranking
Historical Context
Fellegi-Sunter Model
Machine Learning
Bayesian Record Linkage Techniques
Open Research Questions
Practical Considerations
Data Cleaning and Standardization
Missing Data
Error Measurement
Software
Data Sharing
Documentation of Record Linkage Processes
Privacy Preserving Record Linkage
Biases in the Record Linkage Process
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call