The compression–error trade-off for large gridded data sets

Jeremy D Silver,Charles S Zender

doi:10.5194/gmd-10-413-2017

Jeremy D Silver, Charles S Zender

Open Access

https://doi.org/10.5194/gmd-10-413-2017

Copy DOI

Abstract

Abstract. The netCDF-4 format is widely used for large gridded scientific data sets and includes several compression methods: lossy linear scaling and the non-lossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We introduce an alternative compression method called "layer-packing" that simultaneously exploits lossy linear scaling and lossless compression. Layer-packing stores arrays (instead of a scalar pair) of scale and offset parameters. An implementation of this method is compared with lossless compression, storing data at fixed relative precision (bit-grooming) and scalar linear packing in terms of compression ratio, accuracy and speed. When viewed as a trade-off between compression and error, layer-packing yields similar results to bit-grooming (storing between 3 and 4 significant figures). Bit-grooming and layer-packing offer significantly better control of precision than scalar linear packing. Relative performance, in terms of compression and errors, of bit-groomed and layer-packed data were strongly predicted by the entropy of the exponent array, and lossless compression was well predicted by entropy of the original data array. Layer-packed data files must be "unpacked" to be readily usable. The compression and precision characteristics make layer-packing a competitive archive format for many scientific data sets.

Highlights

The volume of both computational and observational geophysical data has grown dramatically in recent decades, and this trend is likely to continue
In all the results presented here, the chunk size was equal to the layers packed using layerpacking
This paper considers layer-packing, scalar linear packing and bit-grooming as a basis for compressing large gridded data sets

Summary

Introduction

The volume of both computational and observational geophysical data has grown dramatically in recent decades, and this trend is likely to continue. Two important sources of large volumes of data are computational modeling and remote-sensing (principally from satellites). These data often have a “hypercube” structure and are stored in formats such as HDF5 (http: //www.hdfgroup.org), netCDF-4 Each of these has their own built-in compression techniques, allowing data to be stored in a compressed format, while simultaneously allowing access to the data (i.e., incorporating the compression/decompression algorithms into the format’s API). These compression methods are either “lossless” (i.e., no precision is lost) or “lossy” (i.e., some accuracy is lost)

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Geoscientific Model Development	Publication Date: Jan 27, 2017
Citations: 7	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

The compression–error trade-off for large gridded data sets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Geoscientific Model Development

Lead the way for us

Similar Papers

How accurate are the performances of gridded precipitation data products over Northeast China?
Muhammad Abrar Faiz ... Song Cui
Atmospheric Research | VOL. 211
Muhammad Abrar Faiz, et. al.Muhammad Abrar Faiz ... Song Cui
18 May 2018
Atmospheric Research | VOL. 211

Statistics education on the sly: exploring large scientific data sets as an entrée to statistical ideas in secondary schools
James Hammerman
-
James HammermanJames Hammerman
30 Dec 2009
30 Dec 2009

Evaluation of three global gridded precipitation data sets in central Asia based on rain gauge observations
Zengyun Hu ... Qingxiang Li
International Journal of Climatology | VOL. 38
Zengyun Hu, et. al.Zengyun Hu ... Qingxiang Li
01 Apr 2018
International Journal of Climatology | VOL. 38

Uncertainties of gridded precipitation observations in characterizing spatio‐temporal drought and wetness over Vietnam
Tue M Vu ... Srivatsan V Raghavan
International Journal of Climatology | VOL. 38
Tue M Vu, et. al.Tue M Vu ... Srivatsan V Raghavan
13 Oct 2017
International Journal of Climatology | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The compression–error trade-off for large gridded data sets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Geoscientific Model Development