Abstract

Linked data is a powerful resource within data analytics and population-level research. However, methods for linkage vary and the choice of approach can impact downstream usage of data by introducing assumptions and biases in resulting links. Selecting stringent linkage methods helps strengthen identified links at risk of missing links; meanwhile, lenient rules or ill-considered comparisons may introduce false positive links. Therefore, the approach is non-trivial, requiring careful selection of preprocessing steps, model development and quality review to ensure suitable outputs, which can require significant human expertise and insight. Real-world population-scale linkage can benefit from automation and scalability offered within modern data centres, with many tasks eligible for pipelining, such as applying predefined cleaning routines, training defined models, and generating mapping tables. Despite this, there are still pinch points requiring human interaction, such as selecting appropriate linkage fields, blocking rules and comparison methods, and reviewing quality of predictions. We present an approach to provide scalable automation in linkage pipelines, whilst retaining transparency of the linkage process for downstream users, providing them with a dataset’s life history. The work output for a given dataset is a versioned catalogue documenting the dataset’s journey, with transparent reporting of data origin, linkage settings, routines, and privacy-preserving quality analysis for inspection. This gives researchers insight into how it may affect their data and provides confidence in data usage. These insights also work in both directions, allowing users to provide feedback and iteratively refine linkage approaches.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.