BigDataScript: a scripting language for data pipelines.

Pablo Cingolani,Mathieu Blanchette,Rob Sladek

doi:10.1093/bioinformatics/btu595

Pablo Cingolani, Mathieu Blanchette + Show 1 more

Open Access

https://doi.org/10.1093/bioinformatics/btu595

Copy DOI

Abstract

Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.Contact: pablo.e.cingolani@gmail.com

Highlights

Processing large amounts of data is becoming increasingly important and common in research environments as a consequence of technology improvements and reduced costs of highthroughput experiments
Many of the software tools used in pipelines that solve big data genomics problems are CPU, memory or I/O intensive and commonly run for several hours or even days
A processing pipeline for a sequencingbased genome-wide association study may involve the following steps (Auwera et al, 2013): (i) mapping DNA sequence reads obtained from thousands of patients to a reference genome; (ii) identifying genetic changes present in each patient genome; (iii) annotating these variants with respect to known gene transcripts or other genome landmarks; (iv) applying statistical analyses to identify genetic variants that are associated with differences in the patient phenotypes; and (v) quality control on each of the previous steps

Summary

Introduction

Processing large amounts of data is becoming increasingly important and common in research environments as a consequence of technology improvements and reduced costs of highthroughput experiments. With the democratization of high-throughput approaches and simplified access to processing resources (e.g. cloud computing), researchers must routinely analyze large datasets This paradigm shift with respect to the access and manipulation of information creates new challenges by requiring highly specialized skill, such as implementing data-processing pipelines, to be accessible to a much wider audience. A processing pipeline designed for a ‘multi-core server’ cannot directly be used on a cluster because running tasks on a cluster requires queuing them using cluster-specific commands (e.g. qsub) If using such a language, programmers and researchers must spend significant efforts to deal with architecture-specific details that are not germane to the problem of interest, and pipelines have to be reprogrammed or adapted to run on other computer architectures. This is aggravated by the fact that the requirements change often and the software tools are constantly evolving

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics (Oxford, England)	Publication Date: Sep 3, 2014
Citations: 31	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

BigDataScript: a scripting language for data pipelines.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)

Lead the way for us

Similar Papers

VLSI-cell placement technique for Architecture of Field Programmable Gate Array (FPGA) design
Avnesh Verma ... Sunil Dhingra
Turkish Journal of Electrical Engineering and Computer Sciences | VOL. 17
Avnesh Verma, et. al.Avnesh Verma ... Sunil Dhingra
01 Jan 2009
Turkish Journal of Electrical Engineering and Computer Sciences | VOL. 17

Chapter 13 - Software, Programming and Electronics: Today, Most Electronics Is a Combination of Hardware and Software
Louis E Frenzel
Electronics Explained | VOL. -
Louis E FrenzelLouis E Frenzel
30 Jun 2017
Electronics Explained | VOL. -

SpartanScript
Ajinkya Lakade
-
Ajinkya LakadeAjinkya Lakade
15 Jun 2023
15 Jun 2023

On the correctness of modular systems
Marisa Navarro ... Ana Sánchez
Theoretical Computer Science | VOL. 140
Marisa Navarro, et. al.Marisa Navarro ... Ana Sánchez
01 Mar 1995
Theoretical Computer Science | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

BigDataScript: a scripting language for data pipelines.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)