Abstract

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to 40962 nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above 10242 nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms.

Highlights

  • Current high-performance computers use some form of parallel processing on many levels: beginning at instruction-level parallelism (ILP) and single instruction multiple data (SIMD) support, through the use of dynamic random access memories (DRAM), which transfer data in blocks containing several dozen bytes, up to multi/many-core chips and clusters of machines connected with a fast network

  • The lattice–Boltzmann method (LBM) is a computational fluid dynamics (CFD) algorithm based on cellular automata idea, where automaton cells correspond to points of a uniformly discretised domain of computations

  • We present an implementation of the lattice–Boltzmann method in the Taichi language for both dense and sparse geometries and investigate its performance on a massively parallel graphic processing unit (GPU)

Read more

Summary

Introduction

When neighbouring elements require different operations, the hardware is usually significantly underutilised. These limitations cause that many computational problems, for example in the physic simulations area, require sophisticated algorithms and non-trivial data layouts in the memory to achieve high performance. Typical examples of such problems are simulations on sparse geometries, i.e., geometries for which computations must be performed only for a small part of area/volume.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call