Abstract

The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.

Highlights

  • The curators at the Institute of Classical Archaeology (ICA) at the University of Texas at Austin needed resources for managing an evolving data collection (~4.3 TB in size at the time of writing) efficiently and frequently

  • We present an overview of the test collection, provide an introduction to High Performance Computing (HPC), and discuss the methods used in parallelizing the metadata extraction

  • As data collections grow in size (e.g., 4 TB and above), routine data management tasks, such as extracting metadata, calculating checksums and allowing dynamic archiving, are difficult to perform in a desktop computing environments

Read more

Summary

Introduction

Routine data management tasks such as finding records, identifying data types and production dates, sorting through multiple copies, culling corrupted and redundant files, and reorganizing data have been conducted manually, placing a significant burden on research staff To conduct these tasks more efficiently, the team started experimenting with powerful data analysis methods that exploit collection-level metadata related to file system, file formats, file sizes, and checksums. The ICA staff applied for, and obtained, an ECSS-supported allocation through XSEDE (Charge No TG-HUM130001), and with the help of HPC experts at TACC, developed familiarity with the HPC environment Together they developed a metadata extraction workflow that can be used by data curators on HPC resources with minimal Linux training. We present an overview of the test collection, provide an introduction to HPC, and discuss the methods used in parallelizing the metadata extraction

A Complex Archaeological Data Collection
Findings
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call