Abstract

Across many fields of science, primary data sets like sensor read-outs, time series, and genomic sequences are analyzed by complex chains of specialized tools and scripts exchanging intermediate results in domain-specific file formats. Scientific workflow management systems (SWfMSs) support the development and execution of these tool chains by providing workflow specification languages, graphical editors, fault-tolerant execution engines, etc. However, many SWfMSs are not prepared to handle large data sets because of inadequate support for distributed computing. On the other hand, most SWfMSs that do support distributed computing only allow static task execution orders. We present SAASFEE, a SWfMS which runs arbitrarily complex workflows on Hadoop YARN. Workflows are specified in Cuneiform, a functional workflow language focusing on parallelization and easy integration of existing software. Cuneiform workflows are executed on Hi-WAY, a higher-level scheduler for running workflows on YARN. Distinct features of SAASFEE are the ability to execute iterative workflows, an adaptive task scheduler, re-executable provenance traces, and compatibility to selected other workflow systems. In the demonstration, we present all components of SAASFEE using real-life workflows from the field of genomics.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call