The Transformer architecture has seen a lot of attention in recent years also thanks to its ability to scale well and allow massive parallelism during training. This has made possible the development of Language Models (LMs) of increasing size and the discovery of latent abilities that completely outclass traditional methods e.g. rule-based systems. However, they also introduced new issues, like their inability to retain the history of previous interactions due to their stateless nature or the difficulty in controlling their generation. Different attempts have been made to address these issues, e.g. a `brute force' approach to solving the memory issue is to include the full conversation history in the context window, a solution that is limited by the quadratic scalability of Transformers. In this work, we explore computationally practical solutions to the memory problem. We propose to augment the decoder-only architecture of (most) Large LMs with a (relatively small) memory encoder. Its output is prepended to the decoder's input in a similar fashion to recent works in Adapters and the original Transformer architecture. Initial experiments show promising results, however future work is needed to compare with State-of-the-Art methods.
Read full abstract