Abstract

The goal of lossy data compression is to reduce the storage cost of a data set X while retaining as much information as possible about something (Y) that you care about. For example, what aspects of an image X contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping that maximizes the mutual information while the entropy is kept below some fixed threshold. We present a new method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable X (an image, say) drawn from a class can be distilled into a vector losslessly, so that ; for example, for a binary classification task of cats and dogs, each image X is mapped into a single real number W retaining all information that helps distinguish cats from dogs. For the case of binary classification, we then show how W can be further compressed into a discrete variable by binning W into bins, in such a way that varying the parameter sweeps out the full Pareto frontier, solving a generalization of the discrete information bottleneck (DIB) problem. We argue that the most interesting points on this frontier are “corners” maximizing for a fixed number of bins which can conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm. We find that these Pareto frontiers are not concave, and that recently reported DIB phase transitions correspond to transitions between these corners, changing the number of clusters.

Highlights

  • A core challenge in science, and in life quite generally, is data distillation: Keeping only a manageably small fraction of our available data X while retaining as much information as possible about something (Y) that we care about

  • For the n = 2 case of binary classification, we show how W can be further compressed into a discrete variable Z = g β (W ) ∈ {1, ..., m β } by binning W into m β bins, in such a way that varying the parameter β sweeps out the full Pareto frontier, solving a generalization of the discrete information bottleneck (DIB) problem

  • Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm. We find that these Pareto frontiers are not concave, and that recently reported DIB phase transitions correspond to transitions between these corners, changing the number of clusters

Read more

Summary

Introduction

A core challenge in science, and in life quite generally, is data distillation: Keeping only a manageably small fraction of our available data X while retaining as much information as possible about something (Y) that we care about. H∗ = H ( Z ) (bits stored) and I∗ = I ( Z, Y ) (useful bits) is described by a Pareto frontier, defined as I∗ ( H∗ ) ≡ sup. The colored dots correspond to random likelihood binnings into various numbers of bins, as described, and the upper envelope of all attainable points define the Pareto frontier. Its “corners”, which are marked by black dots and maximize I ( Z, Y ) for M bins (M = 1, 2, ...), are seen to lie close to the vertical dashed lines H ( Z ) = log M, corresponding to all bins having equal size. The core goal of this paper is to present a method for computing such Pareto frontiers

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.