Compressing Tabular Data via Pairwise Dependencies.

Dmitri S Pavlichin,Tsachy Weissman,Amir Ingber

doi:10.1109/dcc.2017.82

Dmitri S Pavlichin, Tsachy Weissman + Show 1 more

Open Access

https://doi.org/10.1109/dcc.2017.82

Copy DOI

Abstract

We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or features) in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.

Full Text