Abstract

Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, scrutinized, challenged, and built upon. However, the intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner. To overcome these issues, we created a cloud-based platform called ORCESTRA (orcestra.ca), which provides a flexible framework for the reproducible processing of multimodal biomedical data. It enables processing of clinical, genomic and perturbation profiles of cancer samples through automated processing pipelines that are user-customizable. ORCESTRA creates integrated and fully documented data objects with persistent identifiers (DOI) and manages multiple dataset versions, which can be shared for future studies.

Highlights

  • Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity

  • These rich preclinical data are often combined with clinical genomics data generated over the past decades[22] with the aim to test whether preclinical biomarkers can be translated in clinical settings to improve patient care

  • We opted for Pachyderm, an open-source orchestration tool for multi-stage language-agnostic data-processing pipelines, maintaining complete reproducibility and provenance through the use of Kubernetes, as it provides the following functionalities: Programming language: Pachyderm supports creating and deploying language-agnostic pipelines across on-premise or cloud infrastructures, a feature supported by DNAnexus, Databricks, and Lifebit

Read more

Summary

Introduction

Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. The intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner To overcome these issues, we created a cloud-based platform called ORCESTRA (orcestra.ca), which provides a flexible framework for the reproducible processing of multimodal biomedical data. The demand for large volumes of multimodal biomedical data has grown drastically, partially due to active research in personalized medicine, and further understanding diseases[1,2,3] This shift has made reproducing research findings much more challenging because of the need to ensure the use of adequate data-handling methods, resulting in the validity and relevance of studies to be questioned[4,5]. A common prevalent example of this is the use of one pipeline for data processing, with no documentation providing justification for the pipeline choice, impacting the dataset released, which is often only a single version

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call