This paper presents a study on macro data load, a novel mechanism to increase the amount of loaded data reuse within a processor. A macro data load brings into the processor a maximum-width data the cache port allows. In a 64-bit processor, for example, a byte load will bring a full 64-bit data from cache and save it in an internal hardware structure, while using for itself only the specified byte out of the 64-bit data. The saved data can be opportunistically reused by later loads internally, reducing relatively more expensive cache accesses. We present a comprehensive availability study using a generalized memory data reuse table (MDRT) to quantify available memory data reuse opportunities in a set of benchmark programs drawn from the SPEC2k and MiBench suites, and to demonstrate the efficacy of the proposed scheme. The macro data load mechanism is shown to open up significantly more loaded data reuse opportunities than previous schemes with no support for spatial locality. We observe 15.1 percent (SPEC2k integer), 20.9 percent (SPEC2k floating-point), and 45.8 percent (MiBench) more load-to-load forwarding instances when a 256-entry MDRT is used. We also describe a modified load store queue design as a possible implementation of the proposed concept. Our quantitative study using a realistic processor model shows that 21.3 percent, 14.8 percent, and 23.6 percent of L1 cache accesses in the SPEC2k integer, floating-point, and MiBench programs can be eliminated, resulting in a related energy reduction of 11.4 percent, 9.0 percent, and 14.3 percent on average, respectively.
Read full abstract