YAMP: a containerized workflow enabling reproducibility in metagenomics research.

Alessia Visconti,Tiphaine C Martin,Mario Falchi

doi:10.1093/gigascience/giy072

Alessia Visconti, Tiphaine C Martin + Show 1 more

Open Access

https://doi.org/10.1093/gigascience/giy072

Copy DOI

Journal: GigaScience	Publication Date: Jun 18, 2018
Citations: 21	License type: CC BY 4.0

Affiliation: King's College London

Abstract

YAMP ("Yet Another Metagenomics Pipeline") is a user-friendly workflow that enables the analysis of whole shotgun metagenomic data while using containerization to ensure computational reproducibility and facilitate collaborative research. YAMP can be executed on any UNIX-like system and offers seamless support for multiple job schedulers as well as for the Amazon AWS cloud. Although YAMP was developed to be ready to use by nonexperts, bioinformaticians will appreciate its flexibility, modularization, and simple customization.

Highlights

Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies collecting and analyzing large amounts of data has surged, opening new challenges for data analysis and research reproducibility
To facilitate the discussion on YAMP computational requirements and to assess its ability to reproduce research results described in the literature, we carried out a real-world case study, which included 18 samples collected from different body sites
Despite both the simulation and the real-world case study focus on human metagenomic data, YAMP can be used for the analysis of data that originate from virtually any environment

Summary

Introduction

Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies collecting and analyzing large amounts of data has surged, opening new challenges for data analysis and research reproducibility. Variations across workstations and operating systems represent another obstacle [5, 6] To overcome this issue, tools that allow the development of workflows [7] and software containers [8] have been proposed [9]. Containerized workflows facilitate collaborative projects by ensuring identical analysis processes, comparable results, and allow the automatization of data-intensive repetitive tasks [11]. They save users with little bioinformatics or computational expertise from the hassles of installing the required pieces of software and of designing and implementing often complex analysis orchestrations, while expert bioinformaticians can use them as a starting point for customized analyses, avoiding redundant solutions

Methods

Results

Conclusion