Practical notes on lossy compression of scientific data

Rostislav Kouznetsov

doi:10.5194/egusphere-egu22-1149

Abstract

&lt;p&gt;Lossy compression methods are extremely efficient in terms of space and performance and allow for reduction of network bandwidth and disk space needed to store data arrays without sacrificing the number of stored values. &amp;#160;Lossy compression involves an irreversible transformation of data that reduces the information content of the data. &amp;#160;The transformation introduces a distortion that is normally measured in terms of absolute or relative error. The error is higher for higher compression ratios. &amp;#160;A good choice of lossy compression parameters maximizes the compression ratio while keeping the introduced error within acceptable margins. &amp;#160;Negligence or failure to chose a right compression method or its parameters leads to poor compression ratio, or loss of the data.&lt;/p&gt;&lt;p&gt;A good strategy for lossy compression would involve sepcification of the acceptible error margin and choice of compression parameters and storage format. We will discuss specific techniques of lossy compression, and illustrate pitfalls in choice of the error margins and tools for lossy/lossless compression. The following specific topics will be covered:&lt;/p&gt;&lt;p&gt;1. Packing of floating-point data to integers in NetCDF is sub-optimal in most cases, &amp;#160; and for some quantities leads to severe errors.&lt;br&gt;2. Keeping relative vs absolute precision: false alternative.&lt;br&gt;3. Acceptible error margin depends on both the origin and the intended application of data.&lt;br&gt;4. Smart algorithms to decide on compression parameters have limited area of applicability, &amp;#160; which has to be considered in each individual case.&lt;br&gt;5. Choice of a format for compressed data (NetCDF, GRIB2, Zarr): tradeoff between size, speed and precision.&lt;br&gt;6. What &quot;number_of_significant_digits&quot; and &quot;least_significant_digit&quot; mean in terms of relative/absolute error.&lt;br&gt;7. Bit-Shuffle is not always beneficial.&lt;/p&gt;

Full Text