The stream fetch engine is a high-performance fetch architecture based on the concept of instruction stream. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for the stream fetch engine to provide high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way for improving the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high performance results, which are comparable to state-of-the-art fetch architectures, but with a simpler design that consumes less energy.
Read full abstract