Cost-based analysis of the impact of data completeness and representational consistency

Yoram Timmerman,Rihem Nasfi,Guy De Tré,Filip Pattyn,Antoon Bronselaer

doi:10.1016/j.dss.2023.114044

Abstract

Data quality is an important topic for businesses and therefore requires appropriate analysis tools. Although several rule-based systems exist today for quality measurement, their results do not always reflect the real impact of quality issues on practical data usability and are therefore not well-suited to base economic decisions on. This work practically implements and evaluates an alternative, cost-based approach for data quality analysis starting from a ‘fitness for use’-perspective. The practical impact of completeness and representational consistency of data stored in an integrated relational database is investigated in an experiment with 218 volunteers. Two alternative versions of this database are then prepared by manually improving their data quality. Participants are randomly assigned to one of three databases and are given a set of questions to resolve by means of SQL. As questions are resolved, we measure several cost-based indicators such as ability to solve, time to solve and number of attempts. Results indicate that the impact of data quality issues can differ significantly from what would be expected when using rule-based measurement. Effects range from almost no impact to a 65% reduction in time needed to solve tasks. Effect sizes up to 0.43 using one-way ANCOVA tests are observed.

Full Text