Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool

Allis J Choi,Xuying Xin

doi:10.7191/jeslib.2021.1209

Abstract

Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.

Highlights

Launched in 2012, our institutional repository, ScholarSphere,1 enables University faculty, students and staff to deposit and actively manage their scholarly works, and share them with the university community and the world.2 The local data curation team adopts general data curation practices, CURATE(D) steps (Check, Understand, Request, Augment, Transform, Evaluate, Document) to curate files and datasets in various research areas (STEM, Liberal Arts, Social Sciences) and in various formats including tabular data, image data, software code, etc
We partner with Data Curation Network (DCN), a network of over ten institutions with shared data curation expertise while providing normalized data curation practices and professional development training (Johnston et al 2018)
The process starts with finding the deposits with PDF files such as research papers, articles and reports in the local repository (Figure 2) by using the “Work Type” search, downloading and saving the PDF files to a local folder for the software tool to access for data extraction

Summary

Introduction

Launched in 2012, our institutional repository, ScholarSphere,1 enables University faculty, students and staff to deposit and actively manage their scholarly works, and share them with the university community and the world.2 The local data curation team adopts general data curation practices, CURATE(D) steps (Check, Understand, Request, Augment, Transform, Evaluate, Document) to curate files and datasets in various research areas (STEM, Liberal Arts, Social Sciences) and in various formats including tabular data, image data, software code, etc. One of the top data analytics tools we have been using for visualization, Microsoft Power BI Desktop, has found its additional usage in data curation.

Results

Conclusion