Graphics Processing Units (GPUs) can be used as convenient hardware accelerators to speed up Cellular Automata (CA) simulations, which are employed in many scientific areas. However, an important set of CA have performance constraints due to GPU memory bandwidth. Few studies have fully explored how CA implementations can take advantage of modern GPU architectures, mainly in the case of intensive memory usage. In this paper, we make a thorough study of techniques (stencil computing framework, look-up tables, and packet coding) to efficiently implement CA on GPU, taking into account its detailed architecture. Exhaustive experiments to validate these implementation techniques for a number of significant memory-bounded CA are performed. The CA analysed include the classical Game of Life, a Forest Fire model, a Cyclic cellular automaton, and the WireWorld CA. The experimental results show that implementations using the presented techniques can significantly outperform a baseline standard GPU implementation. The best performance results of all known implementations of memory bounded CA were obtained. Moreover, some of the techniques, like look-up tables or temporal blocking, are indeed relatively easy to implement or to apply when the transition rules are simple. Finally, detailed descriptions and discussions of the indicated techniques are included, which may be useful to practitioners interested in developing high performance simulations in efficient languages based on CA on GPU.
Read full abstract