Ensuring scientific reproducibility in bio-macromolecular modeling via extensive, automated benchmarks

Julia Koehler Leman ,Tanja Kortemme ,Shannon Toney Smith ,Vikram Khipple Mulligan ,Andrew Leaver‐Fay ,Brahm J Yachnin ,Christopher D Bahl ,Andrew Watkins ,Daniel P Farrell ,David Baker ,Johanna K S Tiemann ,Jeffrey J Gray ,Amanda L Loshbaugh ,Jack Maguire ,William R Schief ,Phuong T Nguyen ,Jens Meiler ,Rebecca F Alford ,Chris Bailey‐Kellogg ,Rituparna Samanta ,Sagar D Khare ,Frank D Teets ,Jared Adolf-Bryfogle ,Justin B Siegel ,Steven M Lewis ,Rocco Moretti ,Dominik Gront ,H J Woods ,Kyle Barlow ,Shane Ó Conchúir ,Vladimir Yarov‐Yarovoy ,Sergey Lyskov ,Shourya S Roy Burman ,Ameya Harmalkar ,Georg Kuenze ,Amelie Stein ,Frank Dimaio ,Jason S Fell ,Kresten Lindorff‐Larsen ,Morgan L Nance ,Brian Kuhlman ,Ora Schueler‐Furman ,Ziv Ben-Aharon ,Rhiju Das ,Jason W Labonte ,Ajasja Ljubetič ,William A Hansen ,Jeliazko R Jeliazkov ,Justyna Krys ,Richard Bonneau

doi:10.1038/s41467-021-27222-7

Abstract

Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. Here we show how scientific software applications can be created in a reproducible manner when simple design goals for reproducibility are met. We describe the implementation of a test server framework and 40 scientific benchmarks, covering numerous applications in Rosetta bio-macromolecular modeling. High performance computing cluster integration allows these benchmarks to run continuously and automatically. Detailed protocol captures are useful for developers and users of Rosetta and other macromolecular modeling tools. The framework and design concepts presented here are valuable for developers and users of any type of scientific software and for the scientific community to create reproducible methods. Specific examples highlight the utility of this framework, and the comprehensive documentation illustrates the ease of adding new tests in a matter of hours.

Highlights

We present the general setup of this framework, demonstrate how we solve each of the above challenges, and present the results of the individual benchmarks in the Supplementary Information of this paper, complete with detailed protocol captures
Software testing is an essential part of this strategy which ties into scientific reproducibility
Running scientific benchmarks requires extensive CPU time; we chose to integrate them with our own custom-built test server framework connected to a dedicated high-performance computing (HPC) cluster (Fig. 1A and Supplementary Information)

Summary

Introduction

Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. In addition to poorly controlled computing environment variables, computational methods become increasingly complex pipelines of data handling and processing This effect is further compounded by the explosion of input data through “big data” efforts and exacerbated by a lack of stable, maintained, tested, and well-documented software, creating a huge gap between the theoretical limit for scientific reproducibility and the current reality[3]. These circumstances are often caused by a lack of best practices in software engineering or computer science[4,5], errors in laboratory management during project or personnel transitions, and a lack of academic incentives for software stability, maintenance, and longevity[6]. Barriers are created through intellectual property agreements, competition, and refusal to share inputs, methods, and detailed protocols

Objectives

Methods

Results

Conclusion