The Reproducible Data Reuse (ReDaR) Framework to Capture and Assess Multiple Data Streams

Donald A Keefer,Catherine L Blake

doi:10.1002/pra2.451

Abstract

AbstractMuch of the literature in knowledge discovery from data (KDD) focuses on algorithms that are faster and more accurate at capturing patterns in a given data set. However, answering a research question is fundamentally connected with how well the data is aligned with the questions being asked. Thus, data selection is one of the most important steps to ensure that models produced from the KDD process are useful in practice. A lack of documentation about the data selection rationale and the transformations needed to semantically align the data streams prevents others from reproducing the research and obfuscates development of best practices in data integration. Our goal in this paper is to provide KDD practitioners with a framework that brings together theories in provenance, information quality, and contextual reasoning, to enable researchers to achieve a semantically aligned dataset with data selection, description, and documentation based on an application‐focused assessment.

Full Text