Abstract

Abstract. The netCDF-4 format is widely used for large gridded scientific data sets and includes several compression methods: lossy linear scaling and the non-lossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We introduce an alternative compression method called "layer-packing" that simultaneously exploits lossy linear scaling and lossless compression. Layer-packing stores arrays (instead of a scalar pair) of scale and offset parameters. An implementation of this method is compared with lossless compression, storing data at fixed relative precision (bit-grooming) and scalar linear packing in terms of compression ratio, accuracy and speed. When viewed as a trade-off between compression and error, layer-packing yields similar results to bit-grooming (storing between 3 and 4 significant figures). Bit-grooming and layer-packing offer significantly better control of precision than scalar linear packing. Relative performance, in terms of compression and errors, of bit-groomed and layer-packed data were strongly predicted by the entropy of the exponent array, and lossless compression was well predicted by entropy of the original data array. Layer-packed data files must be "unpacked" to be readily usable. The compression and precision characteristics make layer-packing a competitive archive format for many scientific data sets.

Highlights

  • The volume of both computational and observational geophysical data has grown dramatically in recent decades, and this trend is likely to continue

  • In all the results presented here, the chunk size was equal to the layers packed using layerpacking

  • This paper considers layer-packing, scalar linear packing and bit-grooming as a basis for compressing large gridded data sets

Read more

Summary

Introduction

The volume of both computational and observational geophysical data has grown dramatically in recent decades, and this trend is likely to continue. Two important sources of large volumes of data are computational modeling and remote-sensing (principally from satellites). These data often have a “hypercube” structure and are stored in formats such as HDF5 (http: //www.hdfgroup.org), netCDF-4 Each of these has their own built-in compression techniques, allowing data to be stored in a compressed format, while simultaneously allowing access to the data (i.e., incorporating the compression/decompression algorithms into the format’s API). These compression methods are either “lossless” (i.e., no precision is lost) or “lossy” (i.e., some accuracy is lost)

Methods
Findings
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.