Abstract

As computational pipelines become a bigger part of science, it is important to ensure that the results are reproducible, a concern which has come to the fore in recent years. All developed software should be able to be run automatically without any user intervention. In addition to being valuable to the wider community, which may wish to reproduce or extend a published analysis, reproducible research practices allow for better control over the project by the original authors themselves. For example, keeping a non-executable record of parameters and command line arguments leads to error-prone analysis and opens up the possibility that, when the results are to be written up for publication, the researcher will no longer be able to even completely describe the process that led to them. For large projects, the use of multiple computational cores (either in a multi-core machine or distributed across a compute cluster) is necessary to obtain results in a useful time frame. Furthermore, it is often the case that, as the project evolves, it becomes necessary to save intermediate results while down-stream analyses are designed (or re-designed) and implemented. Under many frameworks, this causes having a single point of entry for the computation becomes increasingly difficult. Jug is a software framework which addresses these issues by caching intermediate results and distributing the computational work as tasks across a network. Jug is written in Python without the use of compiled modules, is completely cross-platform, and available as free software under the liberal MIT license. Jug is available from: http://github.com/luispedro/jug.

Highlights

  • 1 Introduction The value of reproducible research in computational fields has been recognized in several areas, including fields as different computational mathematics, signal processing [18, 59], neuronal network modeling [45], archeology [41], or climate science [20]

  • There is a range of medium-sized problems that can be successfully tackled on a computer cluster with a small number of nodes or even taking advantage of a single multicore machine

  • The example starts with a simple Python function to parse the data directory structure and return a list of input files: from jug import TaskGenerator, CachedFunction import mahotas as mh from sklearn.cross_validation import KFold def load(): ‘‘‘ This function assumes that the images are stored in data/ with filenames matching the pattern label-protein-([0-9]).tiff ‘‘‘ from os import listdir images = [] base = ‘./data/’ for path in listdir(base): if not ‘protein’ in path: continue label = path.split(‘-’)[0] # We only store paths and will load data on demand # This saves memory. im = images.append(im) return images

Read more

Summary

Introduction

The value of reproducible research in computational fields has been recognized in several areas, including fields as different computational mathematics, signal processing [18, 59], neuronal network modeling [45], archeology [41], or climate science [20]. In the task DAG, affected tasks and their descendants must be recomputed Without tool support, this can be a very error-prone operation: by not removing all relevant intermediate files, it is easy to generate an irreproducible state where different blocks of the computation output were generated using different versions of the code. The example starts with a simple Python function to parse the data directory structure and return a list of input files: from jug import TaskGenerator, CachedFunction import mahotas as mh from sklearn.cross_validation import KFold def load(): ‘‘‘ This function assumes that the images are stored in data/ with filenames matching the pattern label-protein-([0-9]).tiff ‘‘‘ from os import listdir images = [] base = ‘./data/’ for path in listdir(base): if not ‘protein’ in path: continue label = path.split(‘-’)[0] # We only store paths and will load data on demand # This saves memory. Name: Zenodo Persistent identifier: https://doi.org/10.5281/ zenodo.847794 Licence: MIT Publisher: Luis Pedro Coelho Version published: 1.6.0 Date published: 24 Aug 2017

Discussion
13. Dask Development Team 2016 Dask
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call