Crowdsourcing chart digitizer: task design and quality control for making legacy open data machine-readable

Satoshi Oyama,Ikki Ohmukai,Hiroaki Dokoshi,Hisashi Kashima,Yukino Baba

doi:10.1007/s41060-016-0025-y

Abstract

Despite recent open data initiatives in many countries, a significant percentage of the data provided is in non-machine-readable formats like image format rather than in a machine-readable electronic format, thereby restricting their usability. Various types of software for digitizing data chart images have been developed. However, such software is designed for manual use and thus requires human intervention, making it unsuitable for automatically extracting data from a large number of chart images. This paper describes the first unified framework for converting legacy open data in chart images into a machine-readable and reusable format by using crowdsourcing. Crowd workers are asked not only to extract data from an image of a chart but also to reproduce the chart objects in a spreadsheet. The properties of the reproduced chart objects give their data structures, including series names and values, which are useful for automatic processing of data by computer. Since results produced by crowdsourcing inherently contain errors, a quality control mechanism was developed that improves accuracy by aggregating tables created by different workers for the same chart image and by utilizing the data structures obtained from the reproduced chart objects. Experimental results demonstrated that the proposed framework and mechanism are effective. The proposed framework is not intended to compete with chart digitizing software, and workers can use it if they feel it is useful for extracting data from charts. Experiments in which workers were encouraged to use such software showed that even if workers used it, the extracted data still contained errors. This indicates that quality control is necessary even if workers use software to extract data from chart images.

Highlights

The most prominent of the recent open data initiatives to publish various kinds of data in electronic format is ones for statistical data gathered by governmental agencies [2]
To the best of our knowledge, this paper presents the first unified framework for converting legacy open data in chart images into a machine-readable, reusable format by using crowdsourcing
Ermilov et al [15] proposed a formalization of tabular data as well as its mapping and transformation to Resource Description Framework (RDF), which enable the crowdsourcing of large-scale semantic mapping of tabular data

Summary

Introduction

The most prominent of the recent open data initiatives to publish various kinds of data in electronic format is ones for statistical data gathered by governmental agencies [2]. There have been certain demands for extracting values from statistical charts among the scientific community, typically for reusing data published in old papers To meet such demands, various types of chart digitizing software such as WebPlotDigitizer and DataThief have been developed. We have taken a human computation approach to the datafication of legacy data: use crowdsourcing to extract structured data from charts in legacy file formats such as image and PDF files. Doing this will improve the ranking of such data from one star in Berners-Lee’s scheme to two or three stars.

Related work

Framework for digitizing chart images using crowdsourcing

Structured data extraction through visualization

Feasibility of our crowdsourcing framework

Quality control mechanism

Alignment of rows and columns

Aggregating table headers and cell values

Dataset and software

Accuracy of worker tables

Accuracies of aggregated tables

Analysis of results when workers used chart digitizing software

Example of chart digitizing software

Experimental settings

Task design

Table aggregation

Integrating chart digitizing software in the framework

Converting tables into RDF format

Findings

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International journal of data science and analytics	Publication Date: Oct 1, 2016
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Crowdsourcing chart digitizer: task design and quality control for making legacy open data machine-readable

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of data science and analytics

Lead the way for us

Similar Papers

From one star to three stars: Upgrading legacy open data using crowdsourcing
Satoshi Oyama ... Hisashi Kashima
-
Satoshi Oyama, et. al.Satoshi Oyama ... Hisashi Kashima
01 Oct 2015
01 Oct 2015

ScatterPlotAnalyzer: Digitizing Images of Charts Using Tensor-Based Computational Model
Komal Dadhich ... Jaya Sreevalsan-Nair
-
Komal Dadhich, et. al.Komal Dadhich ... Jaya Sreevalsan-Nair
01 Jan 2020
01 Jan 2020

Data Extraction of Circular-Shaped and Grid-like Chart Images.
Filip Bajić ... Josip Job
Journal of imaging | VOL. 8
Filip Bajić, et. al.Filip Bajić ... Josip Job
12 May 2022
Journal of imaging | VOL. 8

Quality control in an unreliable world
Pernille Rørth
The EMBO Journal | VOL. 27
Pernille RørthPernille Rørth
23 Jan 2008
The EMBO Journal | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Crowdsourcing chart digitizer: task design and quality control for making legacy open data machine-readable

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of data science and analytics