Doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Daniel Svensson,Rickard Sjögren,David Sundell,Andreas Sjödin,Johan Trygg

doi:10.1186/s12859-019-3091-z

Daniel Svensson, Rickard Sjögren + Show 3 more

Open Access

https://doi.org/10.1186/s12859-019-3091-z

Copy DOI

Abstract

BackgroundSelecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed.ResultsWe present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline.ConclusionsOur proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.

Highlights

Selecting the proper parameter settings for bioinformatic software tools is challenging
We propose an approach for the optimization of software parameters, based on methods derived from statistical design of experiments
Sequence reads from less complex segments of the genome will map to more than one position, causing ambiguities that are not possible to resolve, and this in turn leads to fragmentation of the assembly

Summary

Introduction

Selecting the proper parameter settings for bioinformatic software tools is challenging. The strategy for selecting parameter settings typically consists of using values derived from personal or peer experience and obtained in a trialand-error fashion, or retaining the default values. This kind of non-systematic selection of parameter settings runs the risk of producing sub-optimal results. DoE aims to maximize information gain while minimizing the number of experiments required [5] This is done by introducing variation into the system under investigation in a structured manner in order to explain how the parameters (factors) influence the result (response). GSDs reduce the number of runs required to explore an equivalent parameter space by an integer factor, called the reduction factor

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Oct 15, 2019
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

Doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

A Comparative Study of RNA-Seq Aligners Reveals Novoalign’s Default Setting as an Optimal Setting for the Alignment of HeLa RNA-Seq Reads
Kristine Sandra Pey Adum ... Hasni Arsad
Pertanika Journal of Science and Technology | VOL. 30
Kristine Sandra Pey Adum, et. al.Kristine Sandra Pey Adum ... Hasni Arsad
23 Sep 2022
Pertanika Journal of Science and Technology | VOL. 30

Identification of Optimal Process Parameter Settings Based on Manufacturing Performance for Fused Filament Fabrication of CFR-PEEK
Kijung Park ... Hyun Woo Jeon
Applied Sciences | VOL. 10
Kijung Park, et. al.Kijung Park ... Hyun Woo Jeon
03 Jul 2020
Applied Sciences | VOL. 10

Optimizing hadoop parameter settings with gene expression programming guided PSO
Mukhtaj Khan ... Zhengwen Huang
Concurrency and Computation: Practice and Experience | VOL. 29
Mukhtaj Khan, et. al.Mukhtaj Khan ... Zhengwen Huang
24 Feb 2016
Concurrency and Computation: Practice and Experience | VOL. 29

HYPER-PARAMETER OPTIMIZATION AND EVALUATION ON SELECTED MACHINE LEARNING ALGORITHM USING HEPATITIS DATASET
Aminat Yusuf ... Oyelola Akande
FUDMA JOURNAL OF SCIENCES | VOL. 5
Aminat Yusuf, et. al.Aminat Yusuf ... Oyelola Akande
13 Jul 2021
FUDMA JOURNAL OF SCIENCES | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics