Abstract

Despite recent open data initiatives in many countries, a significant percentage of the data provided is in non-machine-readable formats like image format rather than in a machine-readable electronic format, thereby restricting their usability. Various types of software for digitizing data chart images have been developed. However, such software is designed for manual use and thus requires human intervention, making it unsuitable for automatically extracting data from a large number of chart images. This paper describes the first unified framework for converting legacy open data in chart images into a machine-readable and reusable format by using crowdsourcing. Crowd workers are asked not only to extract data from an image of a chart but also to reproduce the chart objects in a spreadsheet. The properties of the reproduced chart objects give their data structures, including series names and values, which are useful for automatic processing of data by computer. Since results produced by crowdsourcing inherently contain errors, a quality control mechanism was developed that improves accuracy by aggregating tables created by different workers for the same chart image and by utilizing the data structures obtained from the reproduced chart objects. Experimental results demonstrated that the proposed framework and mechanism are effective. The proposed framework is not intended to compete with chart digitizing software, and workers can use it if they feel it is useful for extracting data from charts. Experiments in which workers were encouraged to use such software showed that even if workers used it, the extracted data still contained errors. This indicates that quality control is necessary even if workers use software to extract data from chart images.

Highlights

  • The most prominent of the recent open data initiatives to publish various kinds of data in electronic format is ones for statistical data gathered by governmental agencies [2]

  • To the best of our knowledge, this paper presents the first unified framework for converting legacy open data in chart images into a machine-readable, reusable format by using crowdsourcing

  • Ermilov et al [15] proposed a formalization of tabular data as well as its mapping and transformation to Resource Description Framework (RDF), which enable the crowdsourcing of large-scale semantic mapping of tabular data

Read more

Summary

Introduction

The most prominent of the recent open data initiatives to publish various kinds of data in electronic format is ones for statistical data gathered by governmental agencies [2]. There have been certain demands for extracting values from statistical charts among the scientific community, typically for reusing data published in old papers To meet such demands, various types of chart digitizing software such as WebPlotDigitizer and DataThief have been developed. We have taken a human computation approach to the datafication of legacy data: use crowdsourcing to extract structured data from charts in legacy file formats such as image and PDF files. Doing this will improve the ranking of such data from one star in Berners-Lee’s scheme to two or three stars.

Related work
Framework for digitizing chart images using crowdsourcing
Structured data extraction through visualization
Feasibility of our crowdsourcing framework
Quality control mechanism
Alignment of rows and columns
Aggregating table headers and cell values
Dataset and software
Accuracy of worker tables
Accuracies of aggregated tables
Analysis of results when workers used chart digitizing software
Example of chart digitizing software
Experimental settings
Task design
Table aggregation
Integrating chart digitizing software in the framework
Converting tables into RDF format
Findings
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call