Sort by
Finite Blocklength Lossy Source Coding for Discrete Memoryless Sources

In this monograph, we review recent advances in second-order asymptotics for lossy source coding, which provides approximations to the finite blocklength performance of optimal codes. The monograph is divided into three parts. In part I, we motivate the monograph, present basic definitions, introduce mathematical tools and illustrate the motivation of non-asymptotic and second-order asymptotics via the example of lossless source coding. In part II, we first present existing results for the rate-distortion problem with proof sketches. Subsequently, we present five generations of the rate-distortion problem to tackle various aspects of practical quantization tasks: noisy source, noisy channel, mismatched code, Gauss-Markov source and fixed-to-variable length compression. By presenting theoretical bounds for these settings, we illustrate the effect of noisy observation of the source, the influence of noisy transmission of the compressed information, the effect of using a fixed coding scheme for an arbitrary source and the roles of source memory and variable rate. In part III, we present four multiterminal generalizations of the rate-distortion problem to consider multiple encoders, decoders or source sequences: the Kaspi problem, the successive refinement problem, the Fu-Yeung problem and the Gray-Wyner problem. By presenting theoretical bounds for these multiterminal problems, we illustrate the role of side information, the optimality of stop and transmit, the effect of simultaneous lossless and lossy compression, and the tradeoff between encoders' rates in compressing correlated sources. Finally, we conclude the monograph, mention related results and discuss future directions.

Open Access
Relevant
Information-Theoretic Foundations of DNA Data Storage

Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.

Open Access
Relevant
Modeling and Optimization of Latency in Erasure-coded Storage Systems

As consumers are increasingly engaged in social networking and E-commerce activities, businesses grow to rely on Big Data analytics for intelligence, and traditional IT infrastructures continue to migrate to the cloud and edge, these trends cause distributed data storage demand to rise at an unprecedented speed. Erasure coding has seen itself quickly emerged as a promising technique to reduce storage cost while providing similar reliability as replicated systems, widely adopted by companies like Facebook, Microsoft and Google. However, it also brings new challenges in characterizing and optimizing the access latency when erasure codes are used in distributed storage. The aim of this monograph is to provide a review of recent progress (both theoretical and practical) on systems that employ erasure codes for distributed storage. In this monograph, we will first identify the key challenges and taxonomy of the research problems and then give an overview of different approaches that have been developed to quantify and model latency of erasure-coded storage. This includes recent work leveraging MDS-Reservation, Fork-Join, Probabilistic, and Delayed-Relaunch scheduling policies, as well as their applications to characterize access latency (e.g., mean, tail, asymptotic latency) of erasure-coded distributed storage systems. We will also extend the problem to the case when users are streaming videos from erasure-coded distributed storage systems. Next, we bridge the gap between theory and practice, and discuss lessons learned from prototype implementation. In particular, we will discuss exemplary implementations of erasure-coded storage, illuminate key design degrees of freedom and tradeoffs, and summarize remaining challenges in real-world storage systems such as in content delivery and caching. Open problems for future research are discussed at the end of each chapter.

Open Access
Relevant
Asymptotic Frame Theory for Analog Coding

Over-complete systems of vectors, or in short, frames, play the role of analog codes in many areas of communication and signal processing. To name a few, spreading sequences for code-division multiple access (CDMA), over-complete representations for multiple-description (MD) source coding, space-time codes, sensing matrices for compressed sensing (CS), and more recently, codes for unreliable distributed computation. In this survey paper we observe an information-theoretic random-like behavior of frame subsets. Such sub-frames arise in setups involving erasures (communication), random user activity (multiple access), or sparsity (signal processing), in addition to channel or quantization noise. The goodness of a frame as an analog code is a function of the eigenvalues of a sub-frame, averaged over all sub-frames. For the highly symmetric class of Equiangular Tight Frames (ETF), as well as for other frames, we show that the empirical eigenvalue distribution of a randomly-selected sub-frame (i) is asymptotically indistinguishable from Wachter's MANOVA distribution; and (ii) exhibits a universal convergence rate to this limit that is empirically indistinguishable from that of a matrix sequence drawn from MANOVA (Jacobi) ensembles of corresponding dimensions. Some of these results are shown via careful statistical analysis of empirical evidence, and some are proved analytically using random matrix theory arguments of independent interest. The goodness measures of the MANOVA limit distribution are better, in a concrete formal sense, than those of the Marchenko-Pastur distribution at the same aspect ratio, implying that deterministic analog codes are better than random (i.i.d.) analog codes. We further give evidence that the ETF (and near ETF) family is in fact superior to any other frame family in terms of its typical sub-frame goodness.

Open Access
Relevant