Streamlining data-intensive biology with workflow systems.

Taylor Reiter,Shannon E K Joslin,N Tessa Pierce-Ward,Phillip T Brooks,Camille Scott,Luiz Irber,C Titus Brown,Charles M Reid

doi:10.1093/gigascience/giaa140

Taylor Reiter, Shannon E K Joslin + Show 6 more

Open Access

https://doi.org/10.1093/gigascience/giaa140

Copy DOI

Journal: GigaScience	Publication Date: Jan 13, 2021
Citations: 33	License type: CC BY 4.0

Affiliation: University of California, Davis

Abstract

As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.

Highlights

We present a guide for work ow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects
Building upon the rich literature of “best” and “good enough” practices for computational biology [8,9,10], we present a series of strategies and practices for adopting work ow systems to streamline data-intensive biology research
This manuscript is designed to help guide biologists towards project, data, and resource management strategies that facilitate and expedite reproducible data analysis in their research. We present these strategies in the context of our own experiences working with highthroughput sequencing data, but many are broadly applicable to biologists working beyond this eld

Summary

Author Summary

We present a guide for work ow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized work ow practices that bene t their own research and facilitate open and reproducible science

Introduction

Conclusion

Scienti c work ows

Robust Cross-Platform Work ows

35. Next-generation biology

37. Singularity

50. Computing environments for reproducibility

55. Plotly

61. Public Microbial Resource Centers

68. Gene Expression Omnibus

71. Erratum

74. Contamination in Low Microbial Biomass Microbiome Studies

76. From Benchtop to Desktop

79. Earth BioGenome Project

81. Whole-genome sequencing of eukaryotes

83. Whole-genome sequencing approaches for conservation biology

85. Selecting RAD-Seq Data Analysis Parameters for Population Genetics

87. Unbroken

88. Responsible RAD

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Streamlining data-intensive biology with workflow systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GigaScience

Lead the way for us

Similar Papers

Clipper: p-value-free FDR control on high-throughput data from two conditions
Xinzhou Ge ... Kyla Woyshner
Genome Biology | VOL. 22
Xinzhou Ge, et. al.Xinzhou Ge ... Kyla Woyshner
11 Oct 2021
Genome Biology | VOL. 22

5 - Analysis of high-throughput data
Vladimir I Razinkov ... Gerd R Kleemann
High-Throughput Formulation Development of Biopharmaceuticals | VOL. -
Vladimir I Razinkov, et. al.Vladimir I Razinkov ... Gerd R Kleemann
14 Oct 2016
High-Throughput Formulation Development of Biopharmaceuticals | VOL. -

Pep2pro: the high-throughput proteomics data processing, analysis, and visualization tool
Matthias Hirsch-Hoffmann ... Wilhelm Gruissem
Frontiers in Plant Science | VOL. 3
Matthias Hirsch-Hoffmann, et. al.Matthias Hirsch-Hoffmann ... Wilhelm Gruissem
01 Jan 2012
Frontiers in Plant Science | VOL. 3

Feature Cluster Selection for High-Throughput Data Analysis
Lei Yu ... Hao Li
-
Lei Yu, et. al.Lei Yu ... Hao Li
01 Nov 2007
01 Nov 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Streamlining data-intensive biology with workflow systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GigaScience