A data driven learning approach for the assessment of data quality

Erik Tute,Nagarajan Ganapathy,Antje Wulff

doi:10.1186/s12911-021-01656-x

Erik Tute, Nagarajan Ganapathy + Show 1 more

Open Access

https://doi.org/10.1186/s12911-021-01656-x

Copy DOI

Abstract

BackgroundData quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues.ObjectivesTo explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task.MethodsWe started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods’ results and corresponding outcome data (data that indicated the data’s suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without.ResultsOur self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods’ results.ConclusionsOur data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task.

Highlights

Data quality assessment is important but complex and task dependent
Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task
The exported measurement methods (MM)-results used for machine learning can be found in file “MM-results_export_for_machine_learning.csv”

Summary

Introduction

Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. A common approach to identify relevant MMs and reference ranges for a given purpose is to review literature on DQA in similar situations, to study published DQA frameworks and to interview experts (cf [12,13,14,15,16]) Complementing this with data driven methods, which are less dependent on experts’ opinions and that better support collaborative learning of DQA-knowledge is desirable. Johnson et al proposed a method to quantify the impact of DQ in different variables on a given purpose based on a linear regression fitted with MM-results and outcome data [17] Their method allows to quantify the task dependent impact for MMs with results suitable for linear regressions but does not address thresholds. We examine a new approach to derive DQA-knowledge from shared MM-results and corresponding outcome data

Objectives

Methods

Results

Discussion

Conclusion