Abstract

Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call