Abstract

We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe how to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site, https://docs.globus.org/mrdp, provides application skeletons that readers can adapt to realize their own research data portals.

Highlights

  • The need for scientists to exchange data has led to an explosion over recent decades in the number and variety of research data portals: systems that provide remote access to data repositories for such purposes as discovery and distribution of reference data, the upload of new data for analysis and/or integration, and data sharing for collaborative analysis

  • We review various deployments to show how the modern research data portal (MRDP) approach has been applied in practice: examples like the National Center for Atmospheric Research’s Research Data Archive, which provides for high-speed data delivery to thousands of geoscientists; the Sanger Imputation Service, which provides for online analysis of user-provided genomic data; the Globus data publication service, which provides for interactive data publication and discovery; and the DMagic data sharing system for data distribution from light sources

  • In all cases we report only successfully completed transfers. (Ongoing transfers, and transfers canceled by users, or by Globus due to errors, are not included.) These results highlight the large amounts of data that can be handled by such portals

Read more

Summary

INTRODUCTION

The need for scientists to exchange data has led to an explosion over recent decades in the number and variety of research data portals: systems that provide remote access to data repositories for such purposes as discovery and distribution of reference data, the upload of new data for analysis and/or integration, and data sharing for collaborative analysis. We introduce the MRDP design pattern and describe its realization via the integration of two elements: Science DMZs (Dart et al, 2013) (high-performance network enclaves that connect large-scale data servers directly to high-speed networks) and cloud-based data management and authentication services such as those provided by Globus (Chard, Tuecke & Foster, 2014). A single server handles request processing, data access, authentication, and other functions It is the simple and monolithic architecture that characterizes this legacy research data portal (LRDP) design pattern that has allowed its widespread application and its adaptation to many purposes.

A REFERENCE MRDP IMPLEMENTATION
EVALUATION OF MRDP ADOPTION
SUMMARY

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.