Leveraging High Performance Computing for Managing Large and Evolving Data Collections

Jessica Trelogan,Maria Esteva,Ritu Arora

doi:10.2218/ijdc.v9i2.331

Abstract

The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a projectâ€™s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.

Highlights

The curators at the Institute of Classical Archaeology (ICA) at the University of Texas at Austin needed resources for managing an evolving data collection (~4.3 TB in size at the time of writing) efficiently and frequently
We present an overview of the test collection, provide an introduction to High Performance Computing (HPC), and discuss the methods used in parallelizing the metadata extraction
As data collections grow in size (e.g., 4 TB and above), routine data management tasks, such as extracting metadata, calculating checksums and allowing dynamic archiving, are difficult to perform in a desktop computing environments

Summary

Introduction

Routine data management tasks such as finding records, identifying data types and production dates, sorting through multiple copies, culling corrupted and redundant files, and reorganizing data have been conducted manually, placing a significant burden on research staff To conduct these tasks more efficiently, the team started experimenting with powerful data analysis methods that exploit collection-level metadata related to file system, file formats, file sizes, and checksums. The ICA staff applied for, and obtained, an ECSS-supported allocation through XSEDE (Charge No TG-HUM130001), and with the help of HPC experts at TACC, developed familiarity with the HPC environment Together they developed a metadata extraction workflow that can be used by data curators on HPC resources with minimal Linux training. We present an overview of the test collection, provide an introduction to HPC, and discuss the methods used in parallelizing the metadata extraction

A Complex Archaeological Data Collection

Findings

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International journal of digital curation	Publication Date: Oct 22, 2014
Citations: 10	License type: cc-by

R Discovery Prime

R Discovery Prime

Leveraging High Performance Computing for Managing Large and Evolving Data Collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of digital curation

Lead the way for us

Similar Papers

Neuroscience Gateway � Cyberinfrastructure Providing Supercomputing Resources for Large Scale Computational Neuroscience Research
Majumdar Amitava ... Yoshimoto Kenneth
Frontiers in Neuroinformatics | VOL. 10
Majumdar Amitava, et. al.Majumdar Amitava ... Yoshimoto Kenneth
01 Jan 2015
Frontiers in Neuroinformatics | VOL. 10

Using High Performance Computing for Conquering Big Data
Antonio Gómez-Iglesias ... Ritu Arora
-
Antonio Gómez-Iglesias, et. al.Antonio Gómez-Iglesias ... Ritu Arora
01 Jan 2015
01 Jan 2015

Automating Job Monitoring System for an Ecosystem of High Performance Computing
Kajornsak Piyoungkorn ... Phithak Thaenkaew
-
Kajornsak Piyoungkorn, et. al.Kajornsak Piyoungkorn ... Phithak Thaenkaew
07 Nov 2017
07 Nov 2017

Network slicing to improve multicasting in HPC clusters
Izzat Alsmadi ... Abdallah Khreishah
Cluster Computing | VOL. 21
Izzat Alsmadi, et. al.Izzat Alsmadi ... Abdallah Khreishah
31 Jan 2018
Cluster Computing | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Leveraging High Performance Computing for Managing Large and Evolving Data Collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of digital curation