Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Ian Stokes-Rees,Piotr Sliz,Wei Yang,Ian Levesque,Ashley Deacon,Frank V Murphy

doi:10.1107/s0909049512009776

Abstract

Early stage experimental data in structural biology is generally unmaintained and inaccessible to the public. It is increasingly believed that this data, which forms the basis for each macromolecular structure discovered by this field, must be archived and, in due course, published. Furthermore, the widespread use of shared scientific facilities such as synchrotron beamlines complicates the issue of data storage, access and movement, as does the increase of remote users. This work describes a prototype system that adapts existing federated cyberinfrastructure technology and techniques to significantly improve the operational environment for users and administrators of synchrotron data collection facilities used in structural biology. This is achieved through software from the Virtual Data Toolkit and Globus, bringing together federated users and facilities from the Stanford Synchrotron Radiation Lightsource, the Advanced Photon Source, the Open Science Grid, the SBGrid Consortium and Harvard Medical School. The performance and experience with the prototype provide a model for data management at shared scientific facilities.

Highlights

The field of structural biology provides atomic-scale models of macromolecules
The trial of the prototype system consisted of configuring Stanford Synchrotron Radiation Lightsource (SSRL) and Northeast Collaborative Access Team (NE-CAT) as Globus Online service (GO) endpoints, setting up the necessary X.509 authentication system, and mapping grid identities to user identities at the participating sites
Users requested grid accounts through the SBGrid Science Portal, which automatically registered them into the SBGrid virtual organizations (VOs), and created a proxy certificate with the National Center for Supercomputing Applications (NCSA) MyProxy server

Summary

Introduction

The field of structural biology provides atomic-scale models of macromolecules While these models are typically made public through the Protein Data Bank (PDB; Berman, 2000), the source experimental data used to establish the models is generally not published. Advances in the technology and automation at these shared facilities are producing higher data rates, with an anticipated need to process terabytes per day in the near future (Soltis et al, 2008). These challenges are similar to those faced by genomics research or high-energy physics: centralized data collection at a shared facility by a large group of users with independent affiliations and collaborations.

Objectives

Methods

Results

Conclusion