Linkage of Early 1900s Irish Census Records: Exploring the Impact of Household Structure and Crowdsourced Labels

Kayla Frisoli

doi:10.1184/r1/14455917.v1

Abstract

Record linkage is the process of identifying records corresponding to unique entities across data sets. Linkingindividuals in historical data allows researchers to better characterize topics like population mobility, impactof local/national events, and generational changes. Historians in Ireland are currently interested in linkingthe recently released 1901 and 1911 census record databases. Like with many (historical) record linkageapplications, there are challenges arising from the digitization of hand-written records, high frequenciesof common names, and human mobility. Traditional methods struggle with these issues, and it is often acknowledged that specific sub-populations (e.g., women who change their names, individuals who move between census dates) are linked with lower accuracy. Additionally, these methods often consider only pairwise record comparisons without incorporating household or relationship information across records. Furthermore, development and assessment of supervised record linkage methodology often relies on labeled data sets with unknown label quality. To help address these challenges, we designed a record linkage interface to study the impact of the human labeling process on the full record linkage pipeline. Via this interface, workers not only link records at the individual level but also at the household and within-household level, matching 1901 Ireland census records to their (potential) 1911 counterparts. In addition, we collect multiple instances for each label to assess label uncertainty. Our work capitalizes on this label collection process as well as known historical changesand the data's household structure. We find evidence that models incorporating this information better linkhard-to-match populations. Beyond linking the actual records and households, we collect information about how the labeler interacts with the interface (e.g., time spent, click patterns), providing rich information across labeler populations. Our approach was iteratively adapted to balance worker engagement, label quality, and monetary expenses. We find differences in downstream record linkage model performance based on changes in label generationand argue that it is critical to pay attention to these changes when labeling records or building models with pre-existing data. Data about the crowdsourced individual and household matches, the human labelers (from both CMU and Amazon MTurk), and the overall labeling process will be made publicly available. We hope this data and our resulting insights prompt new areas of research within and beyond the record linkage community.

Full Text