Abstract

The ever-growing amount of data in the field of life sciences demands standardized ways of high-throughput computational analysis. This standardization requires a thorough documentation of each step in the computational analysis to enable researchers to understand and reproduce the results. However, due to the heterogeneity in software setups and the high rate of change during tool development, reproducibility is hard to achieve. One reason is that there is no common agreement in the research community on how to document computational studies. In many cases, simple flat files or other unstructured text documents are provided by researchers as documentation, which are often missing software dependencies, versions and sufficient documentation to understand the workflow and parameter settings. As a solution we suggest a simple and modest approach for documenting and verifying computational analysis pipelines. We propose a two-part scheme that defines a computational analysis using a Process and an Analysis metadata document, which jointly describe all necessary details to reproduce the results. In this design we separate the metadata specifying the process from the metadata describing an actual analysis run, thereby reducing the effort of manual documentation to an absolute minimum. Our approach is independent of a specific software environment, results in human readable XML documents that can easily be shared with other researchers and allows an automated validation to ensure consistency of the metadata. Because our approach has been designed with little to no assumptions concerning the workflow of an analysis, we expect it to be applicable in a wide range of computational research fields.Database URL: http://deep.mpi-inf.mpg.de/DAC/cmds/pub/pyvalid.zip

Highlights

  • Large national and international research consortia like ICGC, DEEP, Blueprint or ENCODE [1] generate and host vast amounts of genetic and epigenetic data

  • The data avalanche that came with the rise of microarray and next-generation sequencing (NGS) technologies demanded the setup of high-throughput computational analysis tools and pipelines

  • Several format specifications have been developed to comprehensively capture the handling of biological samples in complex studies. These formats are either tailored to specific assays, such as MAGETAB [4] for microarrays, or are more generally applicable like the MAGE-TAB based BIR-TAB specification developed by the modENCODE consortium [5]

Read more

Summary

Introduction

Large national and international research consortia like ICGC (https://icgc.org), DEEP (www.deutsches-epigenomprogramm.de), Blueprint (www.blueprint-epigenome.eu) or ENCODE [1] generate and host vast amounts of genetic and epigenetic data. The annotation metadata related to each datum ideally consist of concise descriptions of how individual files are generated This description typically includes information on the procedures for sample acquisition, sample and donor characteristics such as health status, the type of assay and associated experimental protocols and details on the computer programs applied to analyse the resulting data. The high rate of change in software development is due to the multitude of motivations for altering program code: fixing a bug, replacing an algorithm with a better one, changing the control flow in the program or using a more appropriate data structure, to name just a few Despite all these reasons for changing software, good programmers aim for high stability and robustness of their software interface, e.g. the naming of command line parameters should not change with an incremental software update. We describe a concept for making metadata on computational analysis pipelines available that respects the characteristics outlined earlier

Motivation
Discussion
Related work
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.