Due to its flexibility, compute mode is becoming more and more attractive as a way to implement many of the algorithms part of a state-of-the-art rendering pipeline. A key problem commonly encountered in graphics applications is streaming vertex and geometry processing. In a typical triangle mesh, the same vertex is on average referenced six times. To avoid redundant computation during rendering, a post-transform cache is traditionally employed to reuse vertex processing results. However, such a vertex cache can generally not be implemented efficiently in software and does not scale well as parallelism increases. We explore alternative strategies for reusing per-vertex results on-the-fly during massively-parallel software geometry processing. Given an input stream divided into batches, we analyze the effectiveness of sorting, hashing, and intra-thread-group communication for identifying and exploiting local reuse potential. We design and present four vertex reuse strategies tailored to modern GPU architectures. We demonstrate that, in a variety of applications, these strategies not only achieve effective reuse of vertex processing results, but can boost performance by up to 2-3x compared to a naïve approach. Curiously, our experiments also show that our batch-based approaches exhibit behavior similar to the OpenGL implementation on current graphics hardware.
Read full abstract