A range of computing problems require understanding uncertainty, such as climate modeling, or use statistics to model problems that are otherwise difficult to solve, such as high energy particle collisions. Currently, these computations are handled by hardware where significant energy is spent to suppress stochasticity in materials and devices, and then significant computational resources are expended to re-introduce stochasticisty in algorithms. Instead, we take inspiration from the brain, which features 1015 stochastic synapses. This talk focuses on understanding how to leverage fluctuations in devices to do efficient sampling, which is a fundamental operation in many statistical approaches to computation [1]. In the first part of this talk, we use bitstreams generated by magnetic tunnel junction and tunnel diode devices to generate samples from different distributions, an elementary operation in statistical approaches to modeling. We show how to use elementary operations on multiple bits to improve both the accuracy and complexity of sampling. While intuition motivates the asking which of the two devices is more efficient at generating a random bitstream, in the second part of this talk, we show that this consideration is a small contribution to the overall circuit required to do a complete calculation. To significantly accelerate applications requires devices that minimize the energy and area cost of the CMOS parts of the design, as the true challenge lies in holistic codesign [2].At the heart of statistical approaches to computation is sampling. Typically, a uniform random sample is generated using a pseudo-random number generator (PRNG), followed by a mathematical operation to sample an application-relevant distribution. This sample is then plugged into a sequence of deterministic calculations that comprise a model. We use magnetic tunnel junctions and tunnel diodes to generate a fair coinflip, having equal probability of each of two outscomes, from which we create a uniform random sample. We show how to relate the quality of the device bitstreams to the quality of random samples [3]. While these devices consume significantly less energy than the PRNG, the energy consumed by the PRNG is a fraction of the energy consumed by a complete calculation. We next focus on moving the entire process of sampling non-uniform distributions into hardware. We show that weighted coinflips can be used to sample any distribution with a well-defined probability distribution function using a tree. Simple logic operations can be used to combine many inaccurate fair physical coinflips to produce a single high-accuracy weighted logical coinflip, and high quality samples from non-uniform distributions. Overall, a simple implementation uses a few hundred physical coinflips, some simple logic operations (shift register, comparison, XOR extractor), and memory access to produce a sample. This success points to the potential for moving more of the model into increasingly sophisticated sampling schemes.Having established how the basic element of the computation are connected, we are now ready to examine its efficiency compared to a PRNG, whose cost is roughly nJ/operation. Thus far, we have ignored the analog signal transduction attached to the coinflip devices and the logic operations that tie them together. Coinflip devices which cannot be directly integrated with logic will suffer from a von Neumann bottleneck. Thus, the coinflip devices need to be intimately integrated with logic, requiring paying a per-device transduction penalty. We find the energy cost for even the simplest signal transduction – a stimulating pulse and a latching output – is in the 100 fJ – 1 pJ per device range, which is larger than the energy consumed by either magnetic tunnel junctions or tunnel diodes. Meanwhile, the most expensive component of the logic operations stem from the fine-grained integration of memory. While only 32 weighted coinflips are needed to draw a 32-bit non-uniform sample, there are ~232 possible weights needed. We discuss schemes to prune the values needed to avoid the situation where the weights occupy all of a 1 cm2 die. In sum, the energy and area cost of the coinflip device are less important than the energy cost of signal transduction, and the area taken by memory for storing weights of a sampling tree.We emphasize that none of our estimates represents a tight optimization of any part of an imagined probabilistic computer. Still, in a full-system implementation, the efficient management of signal transduction and memory clearly outweight the energy or space consumed by the coinflip devices themselves. These factors comprise the key variables for stochastic devices that need to be accounted for in holistic codesign for probabilistic computing.SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525.
Read full abstract