Scalable Load and Store Processing in latency tolerant processors

Amit Gandhi

doi:10.15760/etd.8085

Abstract

Memory latency-tolerant architectures support thousands of in-flight instructions without proportionate scaling of cycle-critical processor resources, and thousands of useful instructions can complete in parallel with a long-latency miss to memory. These architectures, however, require large queues to track all loads and stores executed while a long-latency miss is pending. Hierarchical designs alleviate cycle-time impact of these structures but the Content-Addressable-Memory (CAM) and search functions required to enforce memory ordering and provide data-forwarding place high demand on area and power. Many recent proposals address the complexity of load and store queues. However, none of these proposals addresses the fundamental source of complexity in these queues: the constant searching required for enforcing ordering among memory operations and for proper data-forwarding. These earlier proposals only provide mechanisms for coping with search complexity. This dissertation presents a novel proposal for high performance load and store queues that do not require fully-associative searches. We present new load and store processing mechanisms for latency-tolerant architectures. We augment small, primary load and store queues with large, secondary buffers. The secondary load buffer is an un-ordered, set-associative structure, similar to a cache. The secondary store buffer, the Store Redo Log (SRL), is a first-in first-out structure recording the program order of all stores completed in parallel with a miss, and has no CAM and search functions. Instead of the secondary store queue, a cache provides temporary forwarding. The SRL enforces memory ordering by ensuring memory updates occur in program order once the miss returns. The new mechanisms eliminate the CAM and search functions in the secondary load and store buffers, and remove fundamental sources of complexity, power, and area inefficiency in load and store processing. The new organization, while being area and power efficient, is competitive in performance compared to hierarchical designs. The key idea behind our proposal is: "Redoing certain stores to fix dependences is better than trying to constantly enforce dependences." The design of both load and store queues is inherently scalable and significantly simple because of lack of any CAM logic. Our method shows 5x area and 6x total power savings over hierarchical designs.

Full Text