An analysis of reproducibility and non-determinism in HEP software and ROOT data

Peter Ivie,Kevin Lannon,Charles Zheng,Douglas Thain

doi:10.1088/1742-6596/898/10/102007

Peter Ivie, Kevin Lannon + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/898/10/102007

Copy DOI

Abstract

Reproducibility is an essential component of the scientific method. In order to validate the correctness or facilitate the extension of a computational result, it should be possible to re-run a published result and verify that the same results are produced. However, reproducing a computational result is surprisingly difficult: non-determinism and other factors may make it impossible to get the same result, even when running the same code on the same machine on the same day. We explore this problem in the context of HEP codes and data, showing three high level methods for dealing with non-determinism in general: 1) Domain specific methods; 2) Domain specific comparisons; and 3) Virtualization adjustments. Using a CMS workflow with output data stored in ROOT files, we use these methods to prevent, detect, and eliminate some sources of non-determinism. We observe improved determinism using pre-determined random seeds, a predictable progression of system timestamps, and fixed process identifiers. Unfortunately, sources of non-determinism continue to exist despite the combination of all three methods. Hierarchical data comparisons also allow us to appropriately ignore some non-determinism when it is unavoidable. We conclude that there is still room for improvement, and identify directions that can be taken in each method to make an experiment more reproducible.

Full Text