Abstract

With increasing computing capabilities of modern supercomputers, the size of the data generated from the scientific simulations is growing rapidly. As a result, application scientists need effective data summarization techniques that can reduce large-scale multivariate spatiotemporal data sets while preserving the important data properties so that the reduced data can answer domain-specific queries involving multiple variables with sufficient accuracy. While analyzing complex scientific events, domain experts often analyze and visualize two or more variables together to obtain a better understanding of the characteristics of the data features. Therefore, data summarization techniques are required to analyze multi-variable relationships in detail and then perform data reduction such that the important features involving multiple variables are preserved in the reduced data. To achieve this, in this work, we propose a data sub-sampling algorithm for performing statistical data summarization that leverages pointwise information theoretic measures to quantify the statistical association of data points considering multiple variables and generates a sub-sampled data that preserves the statistical association among multi-variables. Using such reduced sampled data, we show that multivariate feature query and analysis can be done effectively. The efficacy of the proposed multivariate association driven sampling algorithm is presented by applying it on several scientific data sets.

Highlights

  • The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities.Modern-day supercomputers can generate data in the order of petabytes and soon we will enter the era of exascale computing [1,2]

  • We introduced the information theoretic measure pointwise mutual information (PMI) which allows quantification of statistical association for each data point which is applicable for two variables only

  • We presented pointwise mutual information (PMI) and a generalized extension of it which allows us to quantify the importance of each data point in terms of their statistical association considering multiple variables

Read more

Summary

Introduction

The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities.Modern-day supercomputers can generate data in the order of petabytes and soon we will enter the era of exascale computing [1,2]. The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities. As the size of the data sets keeps growing, traditional analysis and visualization techniques using full resolution raw data will soon become prohibitive since storing, parsing, and analyzing the full resolution raw data will not be a viable option anymore [3,4,5,6]. This is primarily due to the gap between the disk I/O speed and the data generation speed. Only a small subset of the data can be moved to the permanent storage for exploratory post-hoc analysis

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call