Container-based bioinformatics with Pachyderm.

Jon Ander Novella,Joachim Burman,Daniel Whitenack,Marco Capuccini,Kim Kultima,Stephanie Herman,Payam Emami Khoonsari,Ola Spjuth

doi:10.1093/bioinformatics/bty699

Jon Ander Novella, Joachim Burman + Show 6 more

Open Access

https://doi.org/10.1093/bioinformatics/bty699

Copy DOI

Journal: Bioinformatics	Publication Date: Aug 8, 2018
Citations: 43	License type: CC BY 4.0

Affiliation: Uppsala University

Abstract

MotivationComputational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline.ResultsPachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.Availability and implementationPachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm).Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

The relevance of big data in biomedicine is evident
We demonstrate by means of a metabolomics case study how Pachyderm can enable scalable and sustainable workflows
The goal of this study was to demonstrate Pachyderm as a bioinformatics workflow system based on software containers

Summary

Introduction

The relevance of big data in biomedicine is evident. Technological advances in fields such as massively parallel sequencing (Shendure and Lieberman Aiden, 2012), mass spectrometry (Nilsson et al, 2010) and high-throughput screening (Macarron et al, 2011) are examples of how biology has shifted towards a data intensive field (Marx, 2013). The rapid increase in the number of data points and the size of the observations in those fields pose many difficulties, but this is definitely not the only obstacle. Apart from the need to process large amounts of data, computational biologists must manage analyses that.

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Container-based bioinformatics with Pachyderm.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

An In-Vehicle Data Management Framework for Interaction between IVI and Vehicular Networks
Kabsu Han ... Jeonghun Cho
-
Kabsu Han, et. al.Kabsu Han ... Jeonghun Cho
01 Dec 2013
01 Dec 2013

A data-centric approach to the study of system-level prognostics for cyber physical systems: application to safe UAV operations
Timothy Darrah ... Jeremy Frank
Journal of Surveillance, Security and Safety | VOL. 3
Timothy Darrah, et. al.Timothy Darrah ... Jeremy Frank
01 Jan 2021
Journal of Surveillance, Security and Safety | VOL. 3

Abstract IA20: California Teachers Study (CTS) Data Management Platform: A model for a repeatable turnkey, end-to-end, cloud-based data management and analytics solution for epidemiology cohorts
Sandeep Chandra
Cancer Epidemiology, Biomarkers & Prevention | VOL. 29
Sandeep ChandraSandeep Chandra
01 Sep 2020
Cancer Epidemiology, Biomarkers & Prevention | VOL. 29

A Data Management and Communication Layer for Adaptive, Hexahedral FEM
Judith Hippold ... Gudula Rünger
-
Judith Hippold, et. al.Judith Hippold ... Gudula Rünger
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Container-based bioinformatics with Pachyderm.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics