Abstract

Data preparation has become a necessary but labor and resource intensive step to perform data analytics. To date, such activities still require considerable manual effort from experts. In this paper, we focus on a specific data preparation activity, namely data quality discovery. We explore different settings in which data workers undertake data quality discovery tasks and the implications of those settings for the efficiency and effectiveness of data workers. To this end, we propose DataOps-4G, a data curation platform for generalists, that allows users to interact with data without the need to write code. We wrap up pre-defined code snippets that implement useful functionalities to explore data quality and bundle the code into so-called DataOps. Then, we conduct a lab-based user study to evaluate our DataOps-4G platform from two perspectives: (i) effectiveness, the accuracy of the outcomes achieved by participants; and (ii) efficiency, their effort and strategies in task completion. Our experimental results uncover how effectiveness and efficiency can be affected by their task completion patterns and strategies. This opens up the possibility of popularizing data curation processes by employing non-experts (e.g., from crowdsourcing platforms) and consequently allowing experts to focus on more complex activities (e.g., building machine learning models).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call