Abstract
With increasing computing capabilities of modern supercomputers, the size of the data generated from the scientific simulations is growing rapidly. As a result, application scientists need effective data summarization techniques that can reduce large-scale multivariate spatiotemporal data sets while preserving the important data properties so that the reduced data can answer domain-specific queries involving multiple variables with sufficient accuracy. While analyzing complex scientific events, domain experts often analyze and visualize two or more variables together to obtain a better understanding of the characteristics of the data features. Therefore, data summarization techniques are required to analyze multi-variable relationships in detail and then perform data reduction such that the important features involving multiple variables are preserved in the reduced data. To achieve this, in this work, we propose a data sub-sampling algorithm for performing statistical data summarization that leverages pointwise information theoretic measures to quantify the statistical association of data points considering multiple variables and generates a sub-sampled data that preserves the statistical association among multi-variables. Using such reduced sampled data, we show that multivariate feature query and analysis can be done effectively. The efficacy of the proposed multivariate association driven sampling algorithm is presented by applying it on several scientific data sets.
Highlights
The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities.Modern-day supercomputers can generate data in the order of petabytes and soon we will enter the era of exascale computing [1,2]
We introduced the information theoretic measure pointwise mutual information (PMI) which allows quantification of statistical association for each data point which is applicable for two variables only
We presented pointwise mutual information (PMI) and a generalized extension of it which allows us to quantify the importance of each data point in terms of their statistical association considering multiple variables
Summary
The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities.Modern-day supercomputers can generate data in the order of petabytes and soon we will enter the era of exascale computing [1,2]. The size of the scientific data sets is increasing rapidly with ever-increasing computing capabilities. As the size of the data sets keeps growing, traditional analysis and visualization techniques using full resolution raw data will soon become prohibitive since storing, parsing, and analyzing the full resolution raw data will not be a viable option anymore [3,4,5,6]. This is primarily due to the gap between the disk I/O speed and the data generation speed. Only a small subset of the data can be moved to the permanent storage for exploratory post-hoc analysis
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.