Jug: Software for Parallel Reproducible Computation in Python

Luis Pedro Coelho

doi:10.5334/jors.161

Abstract

As computational pipelines become a bigger part of science, it is important to ensure that the results are reproducible, a concern which has come to the fore in recent years. All developed software should be able to be run automatically without any user intervention. In addition to being valuable to the wider community, which may wish to reproduce or extend a published analysis, reproducible research practices allow for better control over the project by the original authors themselves. For example, keeping a non-executable record of parameters and command line arguments leads to error-prone analysis and opens up the possibility that, when the results are to be written up for publication, the researcher will no longer be able to even completely describe the process that led to them. For large projects, the use of multiple computational cores (either in a multi-core machine or distributed across a compute cluster) is necessary to obtain results in a useful time frame. Furthermore, it is often the case that, as the project evolves, it becomes necessary to save intermediate results while down-stream analyses are designed (or re-designed) and implemented. Under many frameworks, this causes having a single point of entry for the computation becomes increasingly difficult. Jug is a software framework which addresses these issues by caching intermediate results and distributing the computational work as tasks across a network. Jug is written in Python without the use of compiled modules, is completely cross-platform, and available as free software under the liberal MIT license. Jug is available from: http://github.com/luispedro/jug.

Highlights

1 Introduction The value of reproducible research in computational fields has been recognized in several areas, including fields as different computational mathematics, signal processing [18, 59], neuronal network modeling [45], archeology [41], or climate science [20]
There is a range of medium-sized problems that can be successfully tackled on a computer cluster with a small number of nodes or even taking advantage of a single multicore machine
The example starts with a simple Python function to parse the data directory structure and return a list of input files: from jug import TaskGenerator, CachedFunction import mahotas as mh from sklearn.cross_validation import KFold def load(): ‘‘‘ This function assumes that the images are stored in data/ with filenames matching the pattern label-protein-([0-9]).tiff ‘‘‘ from os import listdir images = [] base = ‘./data/’ for path in listdir(base): if not ‘protein’ in path: continue label = path.split(‘-’)[0] # We only store paths and will load data on demand # This saves memory. im = images.append(im) return images

Summary

Introduction

The value of reproducible research in computational fields has been recognized in several areas, including fields as different computational mathematics, signal processing [18, 59], neuronal network modeling [45], archeology [41], or climate science [20]. In the task DAG, affected tasks and their descendants must be recomputed Without tool support, this can be a very error-prone operation: by not removing all relevant intermediate files, it is easy to generate an irreproducible state where different blocks of the computation output were generated using different versions of the code. The example starts with a simple Python function to parse the data directory structure and return a list of input files: from jug import TaskGenerator, CachedFunction import mahotas as mh from sklearn.cross_validation import KFold def load(): ‘‘‘ This function assumes that the images are stored in data/ with filenames matching the pattern label-protein-([0-9]).tiff ‘‘‘ from os import listdir images = [] base = ‘./data/’ for path in listdir(base): if not ‘protein’ in path: continue label = path.split(‘-’)[0] # We only store paths and will load data on demand # This saves memory. Name: Zenodo Persistent identifier: https://doi.org/10.5281/ zenodo.847794 Licence: MIT Publisher: Luis Pedro Coelho Version published: 1.6.0 Date published: 24 Aug 2017

Discussion

13. Dask Development Team 2016 Dask

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Open Research Software	Publication Date: Oct 27, 2017
Citations: 33	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Jug: Software for Parallel Reproducible Computation in Python

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Open Research Software

Lead the way for us

Similar Papers

For Fun and Profit: A History of the Free and Open Source Software Revolution by Christopher J. Tozzi
Mark Priestley
Technology and Culture | VOL. 60
Mark PriestleyMark Priestley
01 Jan 2019
Technology and Culture | VOL. 60

JIT Compilation Policy on Single-Core and Multi-core Machines
Prasad A Kulkarni ... Jay Fuller
-
Prasad A Kulkarni, et. al.Prasad A Kulkarni ... Jay Fuller
01 Feb 2011
01 Feb 2011

License Contracts, Free Software and Creative Commons in Germany
Alexander Peukert ... Dominik König
-
Alexander Peukert, et. al.Alexander Peukert ... Dominik König
01 Jan 2015
01 Jan 2015

EzMAP: Easy Microbiome Analysis Platform
Gnanendra Shanmugam ... Junhyun Jeon
BMC Bioinformatics | VOL. 22
Gnanendra Shanmugam, et. al.Gnanendra Shanmugam ... Junhyun Jeon
07 Apr 2021
BMC Bioinformatics | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Jug: Software for Parallel Reproducible Computation in Python

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Open Research Software