Reusable, set-based selection algorithm for matched control groups

Daniel Thayer,Liv Kosnes,David V Ford,Martin L Heaven,Damon Berridge,Ann John,Keith Lloyd,John W Gregory

doi:10.23889/ijpds.v1i1.395

Abstract

ABSTRACT AimsThe wealth of data available in linked administrative datasets offers great potential for research, but researchers face methodological and computational challenges in data preparation, due to the size and complexity of the data. The creation of matched control groups in the Secure Anonymised Information Linkage (SAIL) Databank illustrates this point: SAIL contains multiple health datasets describing millions of individuals in Wales. The volume of data creates the potential for more precise matching, but only if an appropriate algorithm can be applied. We aimed to create such an algorithm for reuse by many research projects. MethodsWe developed set-based code in SQL that efficiently selects matches from millions of potential combinations in a relational database environment. It is parameterized to allow different matching criteria to be employed as needed, including follow-up time around an index event. A combinatorial optimisation problem occurs when a potential control could match more than one subject, which we solved by ranking potential match pairs first by subject with the fewest potential matches, then by closeness of the match. ResultsOne example of the algorithm’s use was the Suicide Information Database Cymru, an electronic case-control study on suicide in Wales between 2003 and 2011. Subjects who had a cause of death recorded as self-harm were each matched to twenty controls who were alive at the subject’s date of death and had the same gender and similar birth week. The rate of matching success was >99.9%, with all subjects but one matching the full twenty controls. >99.99% of the matched controls had a week of birth that was identical to the subject. The second example was a matched cohort study looking at hospital admissions and type 1 diabetes, using the Brecon register of childhood diabetes in Wales, with matching based on week of birth within two weeks, gender, county of residence, deprivation quintile, and residence in Wales at time of diagnosis. This study had a matching rate of 98.9%; 97.5% of subjects matched to five controls, and 69.8% of matches had the same week of birth. ConclusionsThis algorithm provides good matching performance while executing efficiently and scalably on large datasets. Its implementation as reusable code will facilitate more efficient, high-quality research in SAIL. Instead of spending many hours developing a custom solution, analysts can execute parameterized code in a few minutes. We hope it to be useful more widely beyond SAIL as well.

Highlights

We developed set-based code in SQL that efficiently selects matches from millions of potential combinations in a relational database environment
A combinatorial optimisation problem occurs when a potential control could match more than one subject, which we solved by ranking potential match pairs first by subject with the fewest potential matches, by closeness of the match
The second example was a matched cohort study looking at hospital admissions and type 1 diabetes, using the Brecon register of childhood diabetes in Wales, with matching based on week of birth within two weeks, gender, county of residence, deprivation quintile, and residence in Wales at time of diagnosis

Summary

Aims

The wealth of data available in linked administrative datasets offers great potential for research, but researchers face methodological and computational challenges in data preparation, due to the size and complexity of the data. The creation of matched control groups in the Secure Anonymised Information Linkage (SAIL) Databank illustrates this point: SAIL contains multiple health datasets describing millions of individuals in Wales. The volume of data creates the potential for more precise matching, but only if an appropriate algorithm can be applied. We aimed to create such an algorithm for reuse by many research projects

Methods

Conclusion

Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Reusable, set-based selection algorithm for matched control groups

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Population Data Science

Lead the way for us

Journal: International Journal of Population Data Science	Publication Date: Apr 19, 2017
License type: CC BY-NC-ND 4.0

Similar Papers

Repeatable Research Infrastructure Enabling Administrative Data Analysis
Daniel Thayer ... Daniel Mallory
International Journal of Population Data Science | VOL. 4
Daniel Thayer, et. al.Daniel Thayer ... Daniel Mallory
21 Nov 2019
International Journal of Population Data Science | VOL. 4

Impact of service redesign on the socioeconomic inequity in revascularisation rates for patients with acute myocardial infarction: a natural experiment and electronic record-linked cohort study
Lloyd W Evans ... Gareth R Davies
BMJ Open | VOL. 6
Lloyd W Evans, et. al.Lloyd W Evans ... Gareth R Davies
01 Oct 2016
BMJ Open | VOL. 6

Evaluation of a Sore throat Test and tReat servicE in community Pharmacies (STREP).
Samantha Turner ... Andrew Evans
International Journal of Population Data Science | VOL. 7
Samantha Turner, et. al.Samantha Turner ... Andrew Evans
25 Aug 2022
International Journal of Population Data Science | VOL. 7

Antibiotic use and deprivation: an analysis of Welsh primary care antibiotic prescribing data by socioeconomic status.
Victor Adekanmbi ... Daniel Farewell
Journal of Antimicrobial Chemotherapy | VOL. 75
Victor Adekanmbi, et. al.Victor Adekanmbi ... Daniel Farewell
25 May 2020
Journal of Antimicrobial Chemotherapy | VOL. 75

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reusable, set-based selection algorithm for matched control groups

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Population Data Science