Abstract

Web tables are rich sources of structured data for collecting and analyzing data, but there exist various data quality problems such as missing or inconsistent values. Crowdsourcing provides a new solution that leverages the human cognitive ability to clean tables, but existing crowdsourcing-based solutions suffer from cleaning quality. For most crowd are not experts, the difficulty degree of cleaning tasks will seriously affect the cleaning result. To help people clean web tables effectively and efficiently, it is important to reduce the overall difficulty of tasks. In this paper, we introduce a difficulty-aware crowdsourcing task optimization system CrowdDA, which aims to recommend the best task execution order from easy to difficult for crowd and support various kinds of cleaning tasks for web tables. CrowdDA takes both latency and space constraints into account for task optimization and generates the task execution order that minimizes the overall difficulty of tasks under two constraints. Furthermore, CrowdDA adopts partition strategies for large tables to improve system efficiency, and introduces independent task sequence to tolerate crowd’s inconsistent answers for system robustness. The experiments based on the real-world datasets demonstrate the performance superiority of CrowdDA for improving the cleaning quality of web tables.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.