A Systematic Approach to Reconciling Data Quality Failures: Investigation Using Spinal Cord Injury Data

Wray Buntine,Andrew Nunn,Nandini Anantharama

doi:10.1055/s-0041-1735975

Abstract

Abstract Background Secondary use of electronic health record's (EHR) data requires evaluation of data quality (DQ) for fitness of use. While multiple frameworks exist for quantifying DQ, there are no guidelines for the evaluation of DQ failures identified through such frameworks. Objectives This study proposes a systematic approach to evaluate DQ failures through the understanding of data provenance to support exploratory modeling in machine learning. Methods Our study is based on the EHR of spinal cord injury inpatients in a state spinal care center in Australia, admitted between 2011 and 2018 (inclusive), and aged over 17 years. DQ was measured in our prerequisite step of applying a DQ framework on the EHR data through rules that quantified DQ dimensions. DQ was measured as the percentage of values per field that meet the criteria or Krippendorff's α for agreement between variables. These failures were then assessed using semistructured interviews with purposively sampled domain experts. Results The DQ of the fields in our dataset was measured to be from 0% adherent up to 100%. Understanding the data provenance of fields with DQ failures enabled us to ascertain if each DQ failure was fatal, recoverable, or not relevant to the field's inclusion in our study. We also identify the themes of data provenance from a DQ perspective as systems, processes, and actors. Conclusion A systematic approach to understanding data provenance through the context of data generation helps in the reconciliation or repair of DQ failures and is a necessary step in the preparation of data for secondary use.

Highlights

Secondary use of electronic health record’s (EHR) data requires evaluation of data quality (DQ) for fitness of use
Fields are identified as DQ failures if the percentage of values not meeting the criteria is in the majority, or Krippendorff’s α indicates poor agreement
The paper presents a systematic approach for the analysis of EHR DQ failures through understanding data provenance, and documents the resulting improvements in DQ for secondary use

Summary

Introduction

Secondary use of electronic health record’s (EHR) data requires evaluation of data quality (DQ) for fitness of use. The widespread adoption of electronic health records (EHRs) has been followed by its adoption as a source of research data in multiple domains,[1,2,3,4] commonly termed secondary use. The validity and robustness of such secondary use is dependent on the quality of the underlying EHR data, and multiple data quality (DQ) frameworks[5,6,7,8] have been articulated for this purpose. These frameworks provide assessment methods for analyzing EHR quality in terms of DQ dimensions.[9,10]. Systematic Approach to Reconciling DQ failures in SCI Data Anantharama et al e95

Objectives

Methods

Results

Discussion

Conclusion