Large-scale Scientific Data Research Articles

Computational workflows are a powerful paradigm to represent and manage complex applications, particularly in large-scale distributed scientific data analysis. Workflows represent application components that result in individual computations as well as their interdependences in terms of dataflow. Workflow systems use these representations to manage various aspects of workflow creation and execution for users, such as the automatic assignment of execution resources. This article describes an approach to automating a new aspect of the process: the selection of application components and data sources. We present a novel approach that enables users to specify varying degrees of detail and amount of constraints in a workflow request, including the specification of constraints on input, intermediate or output data in the workflow, abstract workflow component classes rather than specific component implementations, and generic reusable workflow templates that express a pre-defined combination of components. The algorithm elaborates the user request into a set of fully ground workflows with specific choices of data sources and codes to be used so that they can be submitted for mapping and execution. The algorithm searches through the space of possible candidate workflows by creating increasingly more specialized versions of the original template and eliminating candidates that violate constraints cumulated in the candidate workflow as components and data sources are selected. A novel feature of our approach is that it assumes a distributed architecture where data and component catalogues are separate from the workflow system. The algorithm explicitly poses queries to external catalogues, and therefore any reasoning regarding data or component properties is not assumed to occur within the workflow system. We describe our implementation of this approach in the Wings workflow system. This implementation uses the W3C Web Ontology Language and associated reasoners to implement the workflow system as well as the data and component catalogues. This research demonstrates the use of artificial intelligence techniques to support the kinds of automation envisioned by the scientific community for large-scale distributed scientific data analysis.

Read full abstract

We present a new framework for feature-based statistical analysis of large-scale scientific data and demonstrate its effectiveness by analyzing features from Direct Numerical Simulations (DNS) of turbulent combustion. Turbulent flows are ubiquitous and account for transport and mixing processes in combustion, astrophysics, fusion, and climate modeling among other disciplines. They are also characterized by coherent structure or organized motion, i.e. nonlocal entities whose geometrical features can directly impact molecular mixing and reactive processes. While traditional multi-point statistics provide correlative information, they lack nonlocal structural information, and hence, fail to provide mechanistic causality information between organized fluid motion and mixing and reactive processes. Hence, it is of great interest to capture and track flow features and their statistics together with their correlation with relevant scalar quantities, e.g. temperature or species concentrations. In our approach we encode the set of all possible flow features by pre-computing merge trees augmented with attributes, such as statistical moments of various scalar fields, e.g. temperature, as well as length-scales computed via spectral analysis. The computation is performed in an efficient streaming manner in a pre-processing step and results in a collection of meta-data that is orders of magnitude smaller than the original simulation data. This meta-data is sufficient to support a fully flexible and interactive analysis of the features, allowing for arbitrary thresholds, providing per-feature statistics, and creating various global diagnostics such as Cumulative Density Functions (CDFs), histograms, or time-series. We combine the analysis with a rendering of the features in a linked-view browser that enables scientists to interactively explore, visualize, and analyze the equivalent of one terabyte of simulation data. We highlight the utility of this new framework for combustion science; however, it is applicable to many other science domains.

Read full abstract

Large-scale Scientific Data Research Articles

Related Topics

Articles published on Large-scale Scientific Data

An exploration of SciDB in the context of emerging technologies for data stores in particle physics and cosmology

A semantic framework for automatic generation of computational workflows using distributed data and component catalogues

Feature-Based Statistical Analysis of Combustion Simulation Data

From social data mining to forecasting socio-economic crises.

Computation in large-scale scientific and internet data applications is a focus of MMDS 2010

Semantic enabled metadata management in PetaShare

Common Data Format Archiving of Large-Scale Intelligent Transportation Systems Data for Efficient Storage, Retrieval, and Portability

Grids, the TeraGrid and beyond

A scalable virtual environment for large scale scientific data analysis

Adding Intelligence to Scientific Data Management

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large-scale Scientific Data Research Articles

Related Topics

Articles published on Large-scale Scientific Data

An exploration of SciDB in the context of emerging technologies for data stores in particle physics and cosmology

A semantic framework for automatic generation of computational workflows using distributed data and component catalogues

Feature-Based Statistical Analysis of Combustion Simulation Data

From social data mining to forecasting socio-economic crises.

Computation in large-scale scientific and internet data applications is a focus of MMDS 2010

Semantic enabled metadata management in PetaShare

Common Data Format Archiving of Large-Scale Intelligent Transportation Systems Data for Efficient Storage, Retrieval, and Portability

Grids, the TeraGrid and beyond

A scalable virtual environment for large scale scientific data analysis

Adding Intelligence to Scientific Data Management