Exploratory Analysis of Raw Data Files through Dataflows

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, Net CDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Generic data analysis systems like database management systems (DBMS) are not suited because of data loading and data transformation in large scale. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS, such as noDB, RAW and Fast Bit. Their goal is to offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. The proposed approach takes advantage of existing provenance data support from Scientific Workflow Management Systems (SWfMS). When scientific applications are managed by SWfMS, the data is registered along the provenance database at runtime. Therefore, this provenance data may act as a description of theses files. When the SWfMS is dataflow aware, it registers domain data all in the same database. This resulting database becomes an important access method to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support. In this work, we present our dataflow approach for analyzing data from several raw data files and evaluate it with the Montage application from the astronomy domain.

Similar Papers
  • Research Article
  • Cite Count Icon 16
  • 10.1002/cpe.3616
Analyzing related raw data files through dataflows
  • Aug 4, 2015
  • Concurrency and Computation: Practice and Experience
  • Vítor Silva + 3 more

SummaryComputer simulations may ingest and generate high numbers of raw data files. Most of these files follow a de facto standard format established by the application domain, for example, Flexible Image Transport System for astronomy. Although these formats are supported by a variety of programming languages, libraries, and programs, analyzing thousands or millions of files requires developing specific programs. Database management systems (DBMS) are not suited for this, because they require loading the raw data and structuring it, which becomes heavy at large scale. Systems like NoDB, RAW, and FastBit have been proposed to index and query raw data files without the overhead of using a database management system. However, these solutions are focused on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data. Thus, this data management may be time‐consuming and error‐prone. When computer simulations are managed by a scientific workflow management system (SWfMS), they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. However, SWfMS registers provenance at a coarse grain, with limited analysis on elements from raw data files. When the SWfMS is dataflow‐aware, it can register provenance data and the relationships among elements of raw data files altogether in a database, which is useful to access the contents of a large number of files. In this paper, we propose a dataflow approach for analyzing element data from several related raw data files. Our approach is complementary to the existing single raw data file analysis approaches. We use the Montage workflow from astronomy and a workflow from Oil and Gas domain as data‐intensive case studies. Our experimental results for the Montage workflow explore different types of raw data flows like showing all linear transformations involved in projection simulation programs, considering specific mosaic elements from input repositories. The cost for raw data extraction is approximately 3.7% of the total application execution time. Copyright © 2015 John Wiley & Sons, Ltd.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.future.2017.01.016
Raw data queries during data-intensive parallel workflow execution
  • Jan 11, 2017
  • Future Generation Computer Systems
  • Vítor Silva + 6 more

Raw data queries during data-intensive parallel workflow execution

  • Research Article
  • Cite Count Icon 61
  • 10.1002/cpe.1636
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
  • Aug 27, 2010
  • Concurrency and Computation: Practice and Experience
  • Dong Yuan + 4 more

SUMMARYMany scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay‐for‐use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost‐effective trade‐off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. Copyright © 2010 John Wiley & Sons, Ltd.

  • Research Article
  • 10.5075/epfl-thesis-6644
Adaptive Query Processing on Raw Data Files
  • Jan 1, 2015
  • Ioannis Alagiannis

Nowadays, business and scientific applications accumulate data at an increasing pace. This growth of information has already started to outgrow the capabilities of database management systems (DBMS). In a typical DBMS usage scenario, the user should define a schema, load the data and tune the system for an expected workload before submitting any queries. Copying data into a database is a significant investment in terms of time and resources, and in many cases unnecessary or even no longer feasible in practice due to the explosive data growth. Additionally, the way DBMS store and organize data during data loading defines how data will be accessed for a given workload and thus, the maximum performance. Selecting the underlying data layout (row-store or column-store) is a critical first tuning decision which cannot change. Nevertheless, today query analysis is not static; it evolves as queries change. Hence, static design decisions can be suboptimal. In this thesis, we advocate in situ query processing as the principal way to manage data in a database. We reconsider the data loading phase and redesign traditional query processing architectures to work efficiently over raw data files to address the heavy initialization cost that comes with data loading. We present adaptive data loading as an alternative to traditional full a priori data loading. We explore the potential of in situ query processing in the context of current DBMS architectures. We identify performance bottlenecks specific for in situ processing and we introduce an adaptive indexing mechanism (positional map) that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure and techniques for collecting statistics over raw data files. Moreover, we design a flexible query engine that is not built around a single storage layout but it can exploit different storage layouts and data execution strategies in a single engine. It decides during query processing, which design fits the input queries and properly adapts the underlying data storage. By applying code generation techniques, we dynamically generate access operators tailored for specific classes of queries. This thesis revises the traditional paradigm of loading, tuning and then querying by using in situ query processing as the principal way to minimize data-to-query time. We show that raw data files should not be considered ``outside'' the DBMS and full data loading should not be a requirement to exploit database technology. On the contrary, proper techniques specifically tailored to overcome limitations that come with accessing raw data files can eliminate the data loading overhead making, therefore, raw data files a first-class citizen, fully integrated with the query engine. The proposed roadmap can provide guidance on how to convert any traditional DBMS into an efficient in situ query engine.

  • Research Article
  • Cite Count Icon 235
  • 10.1007/s10723-015-9329-8
A Survey of Data-Intensive Scientific Workflow Management
  • Mar 8, 2015
  • Journal of Grid Computing
  • Ji Liu + 3 more

Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for modeling such process. Since the sequential execution of data-intensive scientific workflows may take much time, Scientific Workflow Management Systems (SWfMSs) should enable the parallel execution of data-intensive scientific workflows and exploit the resources distributed in different infrastructures such as grid and cloud. This paper provides a survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques. Based on a SWfMS functional architecture, we give a comparative analysis of the existing solutions. Finally, we identify research issues for improving the execution of data-intensive scientific workflows in a multisite cloud.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/2457317.2457379
Provenance traces from Chiron parallel workflow engine
  • Mar 18, 2013
  • Felipe Horta + 7 more

Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are managed by Scientific Workflow Management Systems (SWfMS). The different languages used by SWfMS may impact in the way the workflow engine executes the workflow, sometimes limiting optimization opportunities. To tackle this issue, we recently proposed a scientific workflow algebra [1]. This algebra is inspired by database relational algebra and it enables automatic optimization of scientific workflows to be executed in parallel in high performance computing (HPC) environments. This way, the experiments presented in this paper were executed in Chiron, a parallel scientific workflow engine implemented to support the scientific workflow algebra. Before executing the workflow, Chiron stores the prospective provenance [2] of the workflow on its provenance database. Each workflow is composed by several activities, and each activity consumes relations. Similarly to relational databases, a relation contains a set of attributes and it is composed by a set of tuples. Each tuple in a relation contains a series of values, each one associated to a specific attribute. The tuples of a relation are distributed to be consumed in parallel over the computing resources according to the workflow activity. During and after the execution, the retrospective provenance [2] is also stored.

  • Conference Article
  • Cite Count Icon 54
  • 10.1145/2457317.2457365
Capturing and querying workflow runtime provenance with PROV
  • Mar 18, 2013
  • Flavio Costa + 6 more

Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are enacted and managed by Scientific Workflow Management Systems (SWfMS). Each SWfMS has its particular approach to execute workflows and to capture and manage their provenance data. Due to the large scale of experiments, it may be unviable to analyze provenance data only after the end of the execution. A single experiment may demand weeks to run, even in high performance computing environments. Thus scientists need to monitor the experiment during its execution, and this can be done through provenance data. Runtime provenance analysis allows for scientists to monitor workflow execution and to take actions before the end of it (i.e. workflow steering). This provenance data can also be used to fine-tune the parallel execution of the workflow dynamically. We use the PROV data model as a basic framework for modeling and providing runtime provenance as a database that can be queried even during the execution. This database is agnostic of SWfMS and workflow engine. We show the benefits of representing and sharing runtime provenance data for improving the experiment management as well as the analysis of the scientific data.

  • Conference Article
  • Cite Count Icon 10
  • 10.1145/2443416.2443418
Evaluating parameter sweep workflows in high performance computing
  • May 20, 2012
  • Fernando Chirigati + 7 more

Scientific experiments based on computer simulations can be defined, executed and monitored using Scientific Workflow Management Systems (SWfMS). Several SWfMS are available, each with a different goal and a different engine. Due to the exploratory analysis, scientists need to run parameter sweep (PS) workflows, which are workflows that are invoked repeatedly using different input data. These workflows generate a large amount of tasks that are submitted to High Performance Computing (HPC) environments. Different execution models for a workflow may have significant differences in performance in HPC. However, selecting the best execution model for a given workflow is difficult due to the existence of many characteristics of the workflow that may affect the parallel execution. We developed a study to show performance impacts of using different execution models in running PS workflows in HPC. Our study contributes by presenting a characterization of PS workflow patterns (the basis for many existing scientific workflows) and its behavior under different execution models in HPC. We evaluated four execution models to run workflows in parallel. Our study measures the performance behavior of small, large and complex workflows among the evaluated execution models. The results can be used as a guideline to select the best model for a given scientific workflow execution in HPC. Our evaluation may also serve as a basis for workflow designers to analyze the expected behavior of an HPC workflow engine based on the characteristics of PS workflows.

  • Research Article
  • Cite Count Icon 13
  • 10.1007/s10586-019-02920-6
Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach
  • Mar 9, 2019
  • Cluster Computing
  • Thaylon Guedes + 4 more

Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.

  • Research Article
  • Cite Count Icon 29
  • 10.1016/j.jss.2010.10.027
A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems
  • Oct 30, 2010
  • Journal of Systems and Software
  • Xiao Liu + 5 more

A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/dcabes.2010.73
Distributed Management of Scientific Workflows in SWIMS
  • Aug 1, 2010
  • Mahmoud El-Gayyar + 2 more

Scientific workflows are emerging as a dominant approach for scientists to assemble highly-specialized applications, and to exchange large heterogeneous datasets to automate the accomplishment of complex scientific tasks. Several Scientific Workflow Management Systems (SWfMS) have already been designed so as to support the execution, and monitoring of scientific workflows. Even though, there are still some additional requirements and challenges must be met in order to provide a fully distributed and efficient SWfMS. SWIMS (Scientific Workflow Management and Integration System) environment has been developed trying to examine the nature of these challenges and to accommodate the missing requirements. In this paper we are going to highlight these requirements and show how the workflow management in SWIMS fulfills them.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-319-73353-1_23
Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow
  • Dec 28, 2017
  • Leonardo Araújo De Jesus + 2 more

Scientific workflows are models composed of activities, data and dependencies whose objective is to represent a computer simulation. Workflows are managed by Scientific Workflow Management System (SWfMS). Such workflows commonly demand for many computational resources once their executions may involve a number of different programs processing a huge volume of data. Thus, the use of High Performance Computing (HPC) environments allied to parallelization techniques provides the support for the execution of such experiments. Some resources provided by clouds can be used to build HPC environments. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility. Thus, SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as checkpoint-restart and replication, but which fault tolerance technique best fits with a specific workflow? This work aims at analyzing several fault tolerance techniques in SWfMSs and recommending the suitable one for the user’s workflow using machine learning techniques and provenance data, thus improving resiliency.

  • Conference Article
  • Cite Count Icon 30
  • 10.1145/1646468.1646470
Exploring many task computing in scientific workflows
  • Nov 16, 2009
  • Eduardo Ogasawara + 7 more

One of the main advantages of using a scientific workflow management system (SWfMS) to orchestrate data flows among scientific activities is to control and register the whole workflow execution. The execution of activities within a workflow with high performance computing (HPC) presents challenges in SWfMS execution control. Current solutions leave the scheduling to the HPC queue system. Since the workflow execution engine does not run on remote clusters, SWfMS are not aware of the parallel strategy of the workflow execution. Consequently, remote execution control and provenance registry of the parallel activities is very limited from the SWfMS side. This work presents a set of components to be included on the workflow specification of any SWMfS to control parallelization of activities as MTC. In addition, these components can gather provenance data during remote workflow execution. Through these MTC components, the parallelization strategy can be registered and reused, and provenance data can be uniformly queried. We have evaluated our approach by performing parameter sweep parallelization in solving the incompressible 3D Navier-Stokes equations. Experimental results show the performance gains with the additional benefits of distributed provenance support.

  • Conference Article
  • Cite Count Icon 69
  • 10.1109/services-i.2009.18
Towards a Taxonomy of Provenance in Scientific Workflow Management Systems
  • Jul 1, 2009
  • Sérgio Manuel Serra Da Cruz + 2 more

Scientific workflow management systems (SWfMS) have been helping scientists to prototype and execute in silico experiments. They can systematically collect provenance information for the derived data products to be later queried. Despite the efforts on building a standard open provenance model (OPM), provenance is tightly coupled to SWfMS. Thus scientific workflow provenance concepts, representation and mechanisms are very heterogeneous, difficult to integrate and dependent on the SWfMS. To help comparing, integrating and analyzing scientific workflow provenance, this paper presents a taxonomy about provenance characteristics. Its classification enables computer scientists to distinguish between different perspectives of provenance and guide to a better understanding of provenance data in general. The analysis of existing approaches will assist us in managing provenance data from distributed heterogeneous workflow executions.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-642-17819-1_28
GExpLine: A Tool for Supporting Experiment Composition
  • Jan 1, 2010
  • Daniel De Oliveira + 5 more

Scientific experiments present several advantages when modeled at high abstraction levels, independent from Scientific Workflow Management System (SWfMS) specification languages. For example, the scientist can define the scientific hypothesis in terms of algorithms and methods. Then, this high level experiment can be mapped into different scientific workflow instances. These instances can be executed by a SWfMS and take advantage of its provenance records. However, each workflow execution is often treated by the SWfMS as independent instances. There are no tools that allow modeling the conceptual experiment and linking it to the diverse workflow execution instances. This work presents GExpLine, a tool for supporting experiment composition through provenance. In an analogy to software development, it can be seen as a CASE tool while a SWfMS can be seen as an IDE. It provides a conceptual representation of the scientific experiment and automatically associates workflow executions with the concept of experiment. By using prospective provenance from the experiment, GExpLine generates corresponding workflows that can be executed by SWfMS. This paper also presents a real experiment use case that reinforces the importance of GExpLine and its prospective provenance support.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.