Machine Learning for Data Linkage

Rhosanna Ellum,Kristina Xhaferaj,Alex Lewis,Viktor Račinskij,Rachel Shipsey,Zoe White

doi:10.23889/ijpds.v8i2.2240

Abstract

Data linkage traditionally uses deterministic and probabilistic methods. Alternatively, machine learning methods can be applied as classification algorithms, using the data to inform decisions. This project compared the quality, in terms of precision and recall, of traditional methods with selected machine learning methods when applied to a standard linkage problem. Two supervised methods, gradient boosted trees (GBT) and multiple layered perceptron classifier (MLPC), and one unsupervised method, maximum entropy classification (MEC), were implemented. The England and Wales 2021 Census to Census Coverage Survey (CCS) linkage was used as a gold-standard (GS) linked dataset to provide training samples for the supervised methods as well as testing samples for all methods. The F1 score (harmonic mean of precision and recall) was used to compare the performance of the models and to determine the optimal parameters and thresholds. The Splink implementation of Fellegi-Sunter with Expectation Maximisation was used as a baseline for comparison. The methods, trained on a sample of the GS, were used to link census and CCS data. All methods performed well with MEC achieving the highest precision (99.79%) but lowest recall (96.36%). The MLPC model achieved the highest F1 score (98.94%). To understand the implications of not retraining supervised models for each dataset, the models were also used to link Census to a health dataset. The supervised models were not retrained using the health data; instead, the optimised GS models were applied. MEC had the lowest precision (96.51%) but the highest recall (98.48%) and highest F1 score (97.49%). With F1 scores of 96.99% and 96.14% respectively, the GBT and MLPC supervised models were not far behind in performance, despite not being trained using health data. We have shown that machine learning methods can be used effectively for data linkage problems. Unsurprisingly, supervised models perform best when trained on and applied to the same data. Further research into generic training may allow us to use both supervised and unsupervised machine learning models for future data linkage.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Machine Learning for Data Linkage

Abstract

Talk to us

Similar Papers

More From: International Journal of Population Data Science

Lead the way for us

Journal: International Journal of Population Data Science	Publication Date: Sep 14, 2023
License type: CC BY 4.0

Similar Papers

Machine learning-based method for forecasting water levels in irrigation and drainage systems
Viet-Hung Truong ... Tuan-Thach Tran
Environmental Technology & Innovation | VOL. 23
Viet-Hung Truong, et. al.Viet-Hung Truong ... Tuan-Thach Tran
01 Aug 2021
Environmental Technology & Innovation | VOL. 23

AUTHORSHIP ATTRIBUTION OF RESPONSA USING CLUSTERING
Yaakov Hacohen-Kerner ... Orr Margaliot
Cybernetics and Systems | VOL. 45
Yaakov Hacohen-Kerner, et. al.Yaakov Hacohen-Kerner ... Orr Margaliot
18 Aug 2014
Cybernetics and Systems | VOL. 45

Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
Lise M Bjerre ... Rami Abielmona
Global Epidemiology | VOL. 8
Lise M Bjerre, et. al.Lise M Bjerre ... Rami Abielmona
04 Oct 2024
Global Epidemiology | VOL. 8

Understanding the role of driver behaviors and performance in safety-critical events: Application of machine learning
Numan Ahmad ... Hamparsum Bozdogan
Journal of Transportation Safety & Security | VOL. ahead-of-print
Numan Ahmad, et. al.Numan Ahmad ... Hamparsum Bozdogan
14 Jun 2024
Journal of Transportation Safety & Security | VOL. ahead-of-print

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine Learning for Data Linkage

Abstract

Talk to us

Similar Papers

More From: International Journal of Population Data Science