Thousands Of In-flight Instructions Research Articles

Memory latency tolerant architectures support thousands of in-flight instructions without scaling cycle-critical processor resources, and thousands of useful instructions can complete in parallel with a miss to memory. These architectures however require large queues to track all loads and stores executed while a miss is pending. Hierarchical designs alleviate cycle time impact of these structures but the CAM and search functions required to enforce memory ordering and provide data forwarding place high demand on area and power. We present new load-store processing algorithms for latency tolerant architectures. We augment primary load and store queues with secondary buffers. The secondary load buffer is a set associative structure, similar to a cache. The secondary store buffer, the Store Redo Log, is a first-in first-out structure recording the program order of all stores completed in parallel with a miss, and has no CAM and search functions. Instead of the secondary store queue, a cache provides temporary forwarding. The SRL enforces memory ordering by ensuring memory updates occur in program order once the miss returns. The new algorithms eliminate the CAM and search functions in the secondary load and store buffers, and remove fundamental sources of complexity, power, and area inefficiency in load/store processing. The new organization, while being area and power efficient, is competitive in performance compared to hierarchical designs.

Read full abstract

The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints.In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor , an architecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

Read full abstract

Thousands Of In-flight Instructions Research Articles

Related Topics

Articles published on Thousands Of In-flight Instructions

Scalable Load and Store Processing in Latency-Tolerant Processors

Scalable Load and Store Processing in Latency Tolerant Processors

Toward kilo-instruction processors

Scalable hardware memory disambiguation for high-ILP processors

Future ILP processors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Thousands Of In-flight Instructions Research Articles

Related Topics

Articles published on Thousands Of In-flight Instructions

Scalable Load and Store Processing in Latency-Tolerant Processors

Scalable Load and Store Processing in Latency Tolerant Processors

Toward kilo-instruction processors

Scalable hardware memory disambiguation for high-ILP processors

Future ILP processors