Increasing the number of processing cores is currently a common way to boost processor performance. However, the load on the memory subsystem consequently increases as the number of its agents grows. Hardware data compression is an unconventional approach to improving memory subsystem performance by reducing, firstly, the main memory access rate by increasing the cache capacity and, secondly, data traffic by packing the data more densely. The paper describes the implementation of hardware data compression in the on-chip network and interprocessor links of a configuration with wide data transmission channels and a wormhole flow control policy. The existing solutions cannot be applied to such configurations because they are essentially based on using narrow data channels and flow control policies implying uninterrupted packet transmission, which is not maintained with the wormhole flow control. The method proposed in this paper enables the use of hardware compression in the aforementioned configuration by moving data compression and decompression from networks to the connected devices, as well as by using a number of optimizations to hide the data processing delays. Optimizations of some specific cases, such as the transmission of large data packets with several cache lines or the transmission of zero data, are considered. Special attention is given to data transmission via interprocessor links, where, due to their lower bandwidth compared to the on-chip network, data compression can be the most beneficial. The increase in memory subsystem bandwidth from using hardware data compression was confirmed in the experiments showing the relative IPC increase in SPEC CPU2017 benchmarks up to 14 percent.
Read full abstract