Signac: A Python framework for data and workflow management

Vyas Ramasubramani,Sharon Glotzer,Paul Dodd,Bradley Dice,Carl Adorf

doi:10.25080/majora-4af1f417-016

Abstract

Computational research requires versatile data and workflow management tools that can easily adapt to the highly dynamic requirements of scientific investigations. Many existing tools require strict adherence to a particular usage pattern, so researchers often use less robust ad hoc solutions that they find easier to adopt. The resulting data fragmentation and methodological incompatibilities significantly impede research. Our talk showcases signac, an open-source Python framework that offers highly modular and scalable solutions for this problem. Named for the Pointillist painter Paul Signac, the framework's powerful workflow management tools enable users to construct and automate workflows that transition seamlessly from laptops to HPC clusters. Crucially, the underlying data model is completely independent of the workflow. The flexible, serverless, and schema-free signac database can be introduced into other workflows with essentially no overhead and no recourse to the signac workflow model. Additionally, the data model's simplicity makes it easy to parse the underlying data without using signac at all. This modularity and simplicity eliminates significant barriers for consistent data management across projects, facilitating improved provenance management and data sharing with minimal overhead.

Highlights

Streamlining data generation and analysis is a critical challenge for science in the age of big data and high performance computing (HPC)
The highly filebased workflows characteristic of computational science are not amenable to traditional relational databases, and HPC applications require that data is available on-demand, enforcing strict performance requirements for any data storage mechanism
Building processes acting on this data requires transparent interaction with HPC clusters without sacrificing testability on personal computers, and these processes must be sufficiently malleable to adapt to changes in scientific inquiries

Summary

Introduction

Streamlining data generation and analysis is a critical challenge for science in the age of big data and high performance computing (HPC). In the context of signac-flow, individual operations are the nodes of a graph, and the pre- or post-conditions associated with each operation determine the vertices To simplify running such workflows, by default the project.py run interface demonstrated in Fig. 3 will automatically run the entire workflow for every job in the workspace. In addition to the core index-related functionality previously mentioned, the signac Project encapsulates numerous additional features, including, for example, the generation of humanreadable views of the hash-obfuscated workspace; the ability to move, copy, or clone a full project; the ability to synchronize data across projects; and the detection of implicit schema We qualify these schema as implicit because they are only defined by the state points of jobs within the workspace, i.e there is nothing like a table schema to enforce a particular structure for the state points of individual jobs. These tools are orthogonal to signac and may be used in conjunction with it

Workflow and Provenance Management

Data Management

Conclusions