HydroZIP: How Hydrological Knowledge can Be Used to Improve Compression of Hydrological Data

Steven Weijs,Nick Van De Giesen,Marc Parlange

doi:10.3390/e15041289

Abstract

From algorithmic information theory, which connects the information content of a data set to the shortest computer program that can produce it, it is known that there are strong analogies between compression, knowledge, inference and prediction. The more we know about a data generating process, the better we can predict and compress the data. A model that is inferred from data should ideally be a compact description of those data. In theory, this means that hydrological knowledge could be incorporated into compression algorithms to more efficiently compress hydrological data and to outperform general purpose compression algorithms. In this study, we develop such a hydrological data compressor, named HydroZIP, and test in practice whether it can outperform general purpose compression algorithms on hydrological data from 431 river basins from the Model Parameter Estimation Experiment (MOPEX) data set. HydroZIP compresses using temporal dependencies and parametric distributions. Resulting file sizes are interpreted as measures of information content, complexity and model adequacy. These results are discussed to illustrate points related to learning from data, overfitting and model complexity.

Highlights

Compression of hydrological data is important to efficiently store the increasing volumes of data [1], but it can be used as a tool for learning about the internal dependence structure or patterns from those data [2,3,4] or to determine its information content [5]
When the model of the data is not known a priori, it needs to be stored with the data to have a full description that can be decoded to yield the original data. This extra description length reduces compression and acts as a natural penalization for model complexity. This penalization is reflected in many principles of algorithmic information theory (AIT), such as Kolmogorov complexity, algorithmic probability, and the minimum description length principle; see [14,15,16,17,18,19]
As benchmarks for the hydrological data compression algorithm we develop in this paper, we use results from a previous experiment [5] using a selection of widely available compression algorithms

Summary

Introduction

Compression of hydrological data is important to efficiently store the increasing volumes of data [1], but it can be used as a tool for learning about the internal dependence structure or patterns from those data [2,3,4] or to determine its information content [5]. It is important to recognize the effect of prior knowledge on information content of data This effect can be intuitively understood and analyzed from the theoretical perspective of algorithmic information theory (AIT) and a related practical data compression framework. When the model of the data is not known a priori, it needs to be stored with the data to have a full description that can be decoded to yield the original data This extra description length reduces compression and acts as a natural penalization for model complexity. This penalization is reflected in many principles of algorithmic information theory (AIT), such as Kolmogorov complexity, algorithmic probability, and the minimum description length principle; see [14,15,16,17,18,19] These principles are consistent with, and complementary to, the Bayesian framework for reasoning about models, data and predictions, or more generally the logic of science [15,20,21,22]. For a more elaborate and formal introduction and references on the link between data compression and algorithmic information theory, the reader is referred to [25] in this same issue, or [5] for a hydrologist’s perspective

Objectives

Results

Discussion

Conclusion