A general concept for consistent documentation of computational analyses.

Peter Ebert,Thomas Lengauer,Karl Nordström,Marcel H Schulz,Fabian Müller

doi:10.1093/database/bav050

Peter Ebert, Thomas Lengauer + Show 3 more

Open Access

https://doi.org/10.1093/database/bav050

Copy DOI

Abstract

The ever-growing amount of data in the field of life sciences demands standardized ways of high-throughput computational analysis. This standardization requires a thorough documentation of each step in the computational analysis to enable researchers to understand and reproduce the results. However, due to the heterogeneity in software setups and the high rate of change during tool development, reproducibility is hard to achieve. One reason is that there is no common agreement in the research community on how to document computational studies. In many cases, simple flat files or other unstructured text documents are provided by researchers as documentation, which are often missing software dependencies, versions and sufficient documentation to understand the workflow and parameter settings. As a solution we suggest a simple and modest approach for documenting and verifying computational analysis pipelines. We propose a two-part scheme that defines a computational analysis using a Process and an Analysis metadata document, which jointly describe all necessary details to reproduce the results. In this design we separate the metadata specifying the process from the metadata describing an actual analysis run, thereby reducing the effort of manual documentation to an absolute minimum. Our approach is independent of a specific software environment, results in human readable XML documents that can easily be shared with other researchers and allows an automated validation to ensure consistency of the metadata. Because our approach has been designed with little to no assumptions concerning the workflow of an analysis, we expect it to be applicable in a wide range of computational research fields.Database URL: http://deep.mpi-inf.mpg.de/DAC/cmds/pub/pyvalid.zip

Highlights

Large national and international research consortia like ICGC, DEEP, Blueprint or ENCODE [1] generate and host vast amounts of genetic and epigenetic data
The data avalanche that came with the rise of microarray and next-generation sequencing (NGS) technologies demanded the setup of high-throughput computational analysis tools and pipelines
Several format specifications have been developed to comprehensively capture the handling of biological samples in complex studies. These formats are either tailored to specific assays, such as MAGETAB [4] for microarrays, or are more generally applicable like the MAGE-TAB based BIR-TAB specification developed by the modENCODE consortium [5]

Summary

Introduction

Large national and international research consortia like ICGC (https://icgc.org), DEEP (www.deutsches-epigenomprogramm.de), Blueprint (www.blueprint-epigenome.eu) or ENCODE [1] generate and host vast amounts of genetic and epigenetic data. The annotation metadata related to each datum ideally consist of concise descriptions of how individual files are generated This description typically includes information on the procedures for sample acquisition, sample and donor characteristics such as health status, the type of assay and associated experimental protocols and details on the computer programs applied to analyse the resulting data. The high rate of change in software development is due to the multitude of motivations for altering program code: fixing a bug, replacing an algorithm with a better one, changing the control flow in the program or using a more appropriate data structure, to name just a few Despite all these reasons for changing software, good programmers aim for high stability and robustness of their software interface, e.g. the naming of command line parameters should not change with an incremental software update. We describe a concept for making metadata on computational analysis pipelines available that respects the characteristics outlined earlier

Motivation

Discussion

Related work

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database : the journal of biological databases and curation	Publication Date: Jan 1, 2015
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A general concept for consistent documentation of computational analyses.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation

Lead the way for us

Similar Papers

Whole-Brain Imaging with Single-Cell Resolution Using Chemical Cocktails and Computational Analysis
Etsuo A Susaki ... Hiroki R Ueda
Cell | VOL. 157
Etsuo A Susaki, et. al.Etsuo A Susaki ... Hiroki R Ueda
01 Apr 2014
Cell | VOL. 157

Topics and Sentiment Surrounding Vaping on Twitter and Reddit During the 2019 e-Cigarette and Vaping Use-Associated Lung Injury Outbreak: Comparative Study.
Dezhi Wu ... Ming Huang
Journal of Medical Internet Research | VOL. 24
Dezhi Wu, et. al.Dezhi Wu ... Ming Huang
13 Dec 2022
Journal of Medical Internet Research | VOL. 24

Text as big data: Develop codes of practice for rigorous computational text analysis in energy social science
Finn Müller-Hansen ... Jan C Minx
Energy Research & Social Science | VOL. 70
Finn Müller-Hansen, et. al.Finn Müller-Hansen ... Jan C Minx
03 Oct 2020
Energy Research & Social Science | VOL. 70

Computational flow cytometric analysis to detect epidermal subpopulations in human skin
Lidan Zhang ... Huifang Li
BioMedical Engineering OnLine | VOL. 20
Lidan Zhang, et. al.Lidan Zhang ... Huifang Li
17 Feb 2021
BioMedical Engineering OnLine | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A general concept for consistent documentation of computational analyses.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation