As the exponential performance-per-watt gains from traditional CMOS scaling end, new computing paradigms are being seriously considered. Neural network algorithms such as deep neural networks (DNN) offer compatibility with analog, in-memory computing paradigm with the potential for orders of magnitude improvement over today’s special purpose chips. In this analog paradigm, neural network weights are stored as conductance levels in a physical matrix of tunable nonvolatile resistance memory devices, such as oxide-based resistive memory (ReRAM), phase change memory (PCRAM), three terminal floating gate, and redox memories. Our detailed analysis of a neural processing unit (NPU) core analog block based on these devices can perform individual inference and training operations at energies as low as 10 fJ per operation (100 TOPs/W), achieving greater than a 100x improvement over the best modern digital neural training processors. This will achieve revolutionary performance for modern deep neural network accelerators, which are currently found in computing from smartphones to datacenters. Improvements originate from the efficient mapping of key inference and training logical operations directly to the physical memory array. The multiply-accumulate operation is physically performed by Kirchoff’s Voltage Law (to multiply an input with a weight) and Kirchoff’s current law (to accumulate column weights) (Fig. 1). The enormous energy expense of moving the data from an arithmetic-logic-unit to a register and back several times per training operation is reduced by a factor roughly proportional to the number of rows (or columns) of a neural network weight matrix (typically >100). Analog array benefits have been well established for the inference stage of neural networks and are now under early commercial development. Harnessing analog arrays for neural network training offers to solve a significantly more energy intensive task than inference only – but analog training remains elusive due to device-level challenges. Training places electrical constraints on the analog memory far beyond those for inference-only accelerators. Foremost among these challenges is obtaining sufficient “linearity” and “symmetry” in analog memory devices. The conductance increments during a weight update operation must occur without feedback to verify the correct conductance change, as this iterative process would negate the energy and latency advances over digital CMOS. This open-loop weight update scheme requires that the analog memory changes a predictable conductance value linearly proportional to the number of write pulses, regardless of the initial state. Hence, this is known as “linearity” of an analog device. The linearity often tends to be differ between the positive and negative conductance increments, indicating poor device "symmetry." Furthermore, devices in an array must have significantly overlapping conductance ranges over many cycles. If the analog memory elements do not have sufficient linearity and symmetry, the neural network will suffer unacceptable accuracy loss during the training stage. This is still an area of active research, with several strong two- and three-terminal candidates. Sandia has developed a characterization methodology and modeling code, CrossSim, which allows the training accuracy to be evaluated for experimental devices. Two terminal emerging memories including oxide-based resistive memory (ReRAM) and phase change memory (PCRAM) are very dense and capable of high speed switching. However, typical TaOx ReRAM and GST PCRAM have been evaluated and found to suffer unacceptable accuracy losses, due to the nonlinearity and asymmetry inherent to the devices. In the case of ReRAM this is most likely due to the nonlinear acceleration of the filament growth these devices when switched from the high conductance to low conductance state, or the abrupt amorphization in the low to high conductance change during the PCRAM RESET process. Recently, three-terminal floating gate semiconductor-oxide-nitride-oxide-semiconductor (SONOS) cell, a close cousin of NAND-flash, have been demonstrated to have high training accuracy, low energy, and may be the best near-term weight storage device. However, challenges of SONOS include the relatively long write latency (>1μs), high voltage, limited scalability, and low endurance. The most novel emerging three terminal class of devices known as nonvolatile redox transistors show excellent training behavior with very low write voltages. These three terminal elements, such as Sandia’s Lithium-Ion Synaptic Transistor for Analog Computing (LISTA), utilize the electrochemical mechanism of a lithium-ion battery, where cathode conductivity is modulated by lithium or proton ion transport (charging and discharging the battery). While very promising, these devices face research challenges of increased speed, endurance, and CMOS integration. This presentation will discuss current research directions for key two- and three-terminal analog neural memory devices, with a focus on areas where materials and device innovation can drive significant process in neural training technology. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. Figure 1
Read full abstract