Revisiting workflow execution in HPC: a data-flow approach
Revisiting workflow execution in HPC: a data-flow approach
- Research Article
16
- 10.1002/cpe.3616
- Aug 4, 2015
- Concurrency and Computation: Practice and Experience
SummaryComputer simulations may ingest and generate high numbers of raw data files. Most of these files follow a de facto standard format established by the application domain, for example, Flexible Image Transport System for astronomy. Although these formats are supported by a variety of programming languages, libraries, and programs, analyzing thousands or millions of files requires developing specific programs. Database management systems (DBMS) are not suited for this, because they require loading the raw data and structuring it, which becomes heavy at large scale. Systems like NoDB, RAW, and FastBit have been proposed to index and query raw data files without the overhead of using a database management system. However, these solutions are focused on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data. Thus, this data management may be time‐consuming and error‐prone. When computer simulations are managed by a scientific workflow management system (SWfMS), they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. However, SWfMS registers provenance at a coarse grain, with limited analysis on elements from raw data files. When the SWfMS is dataflow‐aware, it can register provenance data and the relationships among elements of raw data files altogether in a database, which is useful to access the contents of a large number of files. In this paper, we propose a dataflow approach for analyzing element data from several related raw data files. Our approach is complementary to the existing single raw data file analysis approaches. We use the Montage workflow from astronomy and a workflow from Oil and Gas domain as data‐intensive case studies. Our experimental results for the Montage workflow explore different types of raw data flows like showing all linear transformations involved in projection simulation programs, considering specific mosaic elements from input repositories. The cost for raw data extraction is approximately 3.7% of the total application execution time. Copyright © 2015 John Wiley & Sons, Ltd.
- Conference Article
5
- 10.1109/sbac-padw.2014.32
- Oct 1, 2014
Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, Net CDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Generic data analysis systems like database management systems (DBMS) are not suited because of data loading and data transformation in large scale. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS, such as noDB, RAW and Fast Bit. Their goal is to offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. The proposed approach takes advantage of existing provenance data support from Scientific Workflow Management Systems (SWfMS). When scientific applications are managed by SWfMS, the data is registered along the provenance database at runtime. Therefore, this provenance data may act as a description of theses files. When the SWfMS is dataflow aware, it registers domain data all in the same database. This resulting database becomes an important access method to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support. In this work, we present our dataflow approach for analyzing data from several raw data files and evaluate it with the Montage application from the astronomy domain.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.