Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the most fundamental building blocks of attention and their computational properties within the standard model of deep learning. We first derive a systematic taxonomy of all possible attention mechanisms within, or as extensions of the standard model into 18 classes depending on the origin of the attention signal, the target of the attention signal, and whether the interaction is additive or multiplicative. Second, using this taxonomy, we identify three key attention mechanisms: additive activation attention (multiplexing), multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). Output gating and synaptic gating are proper extensions of the standard model and all current attention-based architectures, including transformers, use either output gating or synaptic gating, or a combination of both. Third, we develop a theory of attention capacity and derive mathematical results about the capacity of basic attention networks comprising linear or polynomial threshold gates. For example, the output gating of a linear threshold gate of n variables by another linear threshold gate of the same n variables has capacity 2n2(1+o(1)), achieving the maximal doubling of the capacity for a doubling of the number of parameters. Perhaps surprisingly, multiplexing attention is used in the proofs of these results. Synaptic and output gating provide computationally efficient extensions of the standard model enabling sparse quadratic activation functions. They can also be viewed as primitives for collapsing several layers of processing in the standard model into shallow compact representations.
Read full abstract