Demystifying probabilistic linkage: Common myths and misconceptions.

James C Doidge,Katie Harron

doi:10.23889/ijpds.v3i1.410

Abstract

Many of the distinctions made between probabilistic and deterministic linkage are misleading. While these two approaches to record linkage operate in different ways and can produce different outputs, the distinctions between them are more a result of how they are implemented than because of any intrinsic differences. In the way they are generally applied, probabilistic and deterministic procedures can be little more than alternative means to similar ends—or they can arrive at very different ends depending on choices that are made during implementation. Misconceptions about probabilistic linkage contribute to reluctance for implementing it and mistrust of its outputs. We aim to explain how the outputs of either approach can be tailored to suit the intended application, but also to highlight the ways in which probabilistic linkage is generally more flexible, more powerful and more informed by the data. This is accomplished by examining common misconceptions about probabilistic linkage and its difference from deterministic linkage, highlighting the potential impact of design choices on the outputs of either approach. We hope that better understanding of linkage designs will help to allay concerns about probabilistic linkage, and help data linkers to select and tailor procedures to produce outputs that are appropriate for their intended use.

Highlights

Many of the distinctions made between probabilistic and deterministic linkage are misleading
Every possible match weight threshold that could be specified in probabilistic linkage corresponds to a set of decision rules that could have been specified in deterministic linkage
Many of the claims made about probabilistic linkage are based on a misinterpretation of match weights as being true likelihoods

Summary

Introduction

Many of the distinctions made between probabilistic and deterministic linkage are misleading. In this article we aim to explain how the outputs of either approach can be tailored to suit the intended application, and to highlight the ways in which probabilistic linkage is generally more flexible, more powerful and more informed by the data. This paper aims to improve readers’ understanding of how record linkage procedures operate and how they can be tuned and adapted to produce quite similar—or very different—outputs, depending on the objectives of the data linker. To achieve this aim, we present a critical discussion of some common myths and misconceptions about probabilistic linkage and the differences between probabilistic and deterministic linkage. Open Access under CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en)

Background

Findings

Summary