Abstract

The trace cache is a technique that provides accurate, high bandwidth instruction fetch. However, when a desired instruction trace is not found in the cache, conventional instruction fetch and decode must be used to satisfy the trace request. Such auxiliary fetch hardware can be expensive in terms of energy, area and complexity. An approach to combine a trace cache and conventional instruction fetch hardware using a decoupled design is explored. The design enables the processor to dynamically switch between trace ID and PC-based prediction methods and helps to hide the latency associated with the instruction memory path. The decoupled design with accelerated slow path instruction delivery and no instruction cache is able to provide comparable benefit to a front-end with an 8 kB instruction cache (within 2% of the instructions per cycle with the cache). High tolerance can be demonstrated for both trace table misses and increased memory latency when scaling down the size of the trace table and scaling up the L2 access latency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call