Abstract

Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term. A set of criteria is introduced to address this problem: completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open-source software. As a proof of concept, we introduce “Maneage” (managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications. We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders. This article is itself a Maneage'd project (project commit 313db0b). Appendices-Two comprehensive appendices that review the longevity of existing solutions are available as supplementary “Web extras,” which are available in the IEEE Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/MCSE.2021.3072860. Reproducibility-All products available in zenodo.4913277, the Git history of this paper's source is at git.maneage.org/paper-concept.git, which is also archived in Software Heritage Heritage: swh:1:dir:33fea87068c1612daf011f161b97787b9a0df39f. Clicking on the SWHIDs in the digital format will provide more “context” for same content.

Highlights

  • Reproducible research has been discussed in the sciences for at least 30 years [1], [2]

  • Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree, they need to understand the technology of their tools because it determines their results and interpretations

  • We have shown that it is possible to build workflows satisfying all the proposed criteria

Read more

Summary

INTRODUCTION

Reproducible research has been discussed in the sciences for at least 30 years [1], [2]. Many reproducible workflow solutions (hereafter, “solutions”) have been proposed that mostly rely on the common technology of the day, starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade. Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree, they need to understand the technology of their tools because it determines their results and interpretations. Scientists are still held accountable for their results and the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience

LONGEVITY OF EXISTING TOOLS
PROPOSED CRITERIA FOR LONGEVITY
PROOF OF CONCEPT
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call