Analyzing related raw data files through dataflows

Vítor Silva,Daniel De Oliveira,Marta Mattoso,Patrick Valduriez

doi:10.1002/cpe.3616

Abstract

SummaryComputer simulations may ingest and generate high numbers of raw data files. Most of these files follow a de facto standard format established by the application domain, for example, Flexible Image Transport System for astronomy. Although these formats are supported by a variety of programming languages, libraries, and programs, analyzing thousands or millions of files requires developing specific programs. Database management systems (DBMS) are not suited for this, because they require loading the raw data and structuring it, which becomes heavy at large scale. Systems like NoDB, RAW, and FastBit have been proposed to index and query raw data files without the overhead of using a database management system. However, these solutions are focused on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data. Thus, this data management may be time‐consuming and error‐prone. When computer simulations are managed by a scientific workflow management system (SWfMS), they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. However, SWfMS registers provenance at a coarse grain, with limited analysis on elements from raw data files. When the SWfMS is dataflow‐aware, it can register provenance data and the relationships among elements of raw data files altogether in a database, which is useful to access the contents of a large number of files. In this paper, we propose a dataflow approach for analyzing element data from several related raw data files. Our approach is complementary to the existing single raw data file analysis approaches. We use the Montage workflow from astronomy and a workflow from Oil and Gas domain as data‐intensive case studies. Our experimental results for the Montage workflow explore different types of raw data flows like showing all linear transformations involved in projection simulation programs, considering specific mosaic elements from input repositories. The cost for raw data extraction is approximately 3.7% of the total application execution time. Copyright © 2015 John Wiley & Sons, Ltd.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Analyzing related raw data files through dataflows

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Aug 4, 2015
Citations: 16

Similar Papers

Raw data queries during data-intensive parallel workflow execution
Vítor Silva ... Marta Mattoso
Future Generation Computer Systems | VOL. 75
Vítor Silva, et. al.Vítor Silva ... Marta Mattoso
11 Jan 2017
Future Generation Computer Systems | VOL. 75

Exploratory Analysis of Raw Data Files through Dataflows
Vitor Silva ... Marta Mattoso
-
Vitor Silva, et. al.Vitor Silva ... Marta Mattoso
01 Oct 2014
01 Oct 2014

Adaptive Query Processing on Raw Data Files

-

01 Jan 2015
01 Jan 2015

Exploring many task computing in scientific workflows
Eduardo Ogasawara ... Alvaro Coutinho
-
Eduardo Ogasawara, et. al.Eduardo Ogasawara ... Alvaro Coutinho
16 Nov 2009
16 Nov 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analyzing related raw data files through dataflows

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience