Abstract

Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call