Abstract

An accurate cost model that accounts for dataset size and structure can help optimize geoscience data analysis. We develop and apply a computational model to estimate data analysis costs for arithmetic operations on gridded datasets typical of satellite- or climate model-origin. For these dataset geometries our model predicts data reduction scalings that agree with measurements of widely used geoscience data processing software, the netCDF Operators (NCO). I/O performance and library design dominate throughput for simple analysis (e.g. dataset differencing). Dataset structure can reduce analysis throughput ten-fold relative to same-sized unstructured datasets. We demonstrate algorithmic optimizations which substantially increase throughput for more complex, arithmetic-dominated analysis such as weighted-averaging of multi-dimensional data. These scaling properties can help to estimate costs of distribution strategies for data reduction in cluster and grid environments.

Highlights

  • Scientific advances in geosciences increasingly depend on large scale computing (e.g. NRC 2001; NSF 2003)

  • The solutions to these problems include seamless or virtual data grids (e.g. Foster et al 2002; Cornillon, Gallagher and Sgouros 2003) and middleware which optimizes the distribution of data analysis across the available computing resources (e.g. Woolf, Haines and Liu 2003; Chen and Agrawal 2004)

  • We are interested in data analysis optimization for geoscience datasets stored on rectangular grids rather than, for example, polygonal meshes common in GIS applications

Read more

Summary

Introduction

Scientific advances in geosciences increasingly depend on large scale computing (e.g. NRC 2001; NSF 2003). Analysis and post-processing of the resulting tera-scale geoscience datasets presents its own set of problems. The solutions to these problems include seamless or virtual data grids Foster et al 2002; Cornillon, Gallagher and Sgouros 2003) and middleware which optimizes the distribution of data analysis across the available computing resources We are interested in data analysis optimization for geoscience datasets stored on rectangular grids rather than, for example, polygonal meshes common in GIS applications. Rectangular datasets are well suited to parallel analysis because their mutually independent coordinates facilitate decomposition into smaller datasets of finer granularity, e.g. chunking (Li et al 2003; Drake, Jones and Carr 2005)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.