We discuss various strategies for parallelizing streamline simulators and present a single-phase shared memory implementation. The choice of a shared memory programming model is motivated by its suitability for streamline simulation, as well as the rapid advance of multicore processors, which are readily available at low-cost. We show that streamline-based methods are easily parallelizable on shared memory architectures through their decomposition of the multidimensional transport equations into a large set of independent 1D transport solves. We tested both a specialized explicit load balancing algorithm that optimizes the streamline load distribution across threads to minimize the time that any of the threads are idle, and the dynamic load balancing algorithms provided by OpenMP on the shared memory machines. Our results clearly indicate that built-in schedulers are competitive with specialized load balancing strategies as long as the number of streamlines per thread is sufficiently high, which is the case in field applications. The average workload per thread is nominally insensitive to workload variations between individual streamlines, and any load balancing advantage offered by explicit strategies is not sufficient to overcome associated computational and parallel overhead. In terms of the allocation of streamlines or streamline segments to threads, we investigated both the distributed approach, in which threads are assigned streamline segments, and the owner approach, in which threads own complete streamlines. We found that the owner approach is most suitable. The slight advantage that the distributed approach has in terms of load balancing is not enough to compensate for the additional overheads. Moreover, the owner approach allows straightforward re-use of existing sequential codes, which is not the case for the distributed approach in case of implicit or adaptive implicit solution strategies. The tracing and mapping stages in streamline simulation have low parallel efficiency. However, in real-field models, the computational burden of the streamline solves is significantly heavier than that of the tracing and mapping stages, and therefore, the impact of these stages is limited. We tested the parallelization on three shared memory systems: a 24 dual-core processor Sun SPARC server; an eight-way Sun Opteron server, representative of the state-of-the-art shared memory systems in use in the industry; and the very recently released Sun Niagara II multicore machine that has eight floating point compute units on the chip. We test a single-phase flow problem on three heterogeneous reservoirs with varying well placements (this system gives the worst case scenario as the tracing and mapping costs are not negligible compared to the transport costs). For the SPARC and Opteron system, we find parallel efficiencies ranging between 60 and 75 for the tracer flow problems. The sublinear speedup is mostly due to communication overheads in the tracing and mapping stages. In applications with more complex physics, the relative contributions of these stages will decrease significantly, and we predict the parallel performance to be nearly linear. On the Niagara II, we obtain almost perfect linear scalability even for the single-phase flow problem thanks to the lowered communication costs on these architectures that have a shared cache. This result is all the more satisfactory considering that future server designs will be akin to this system.
Read full abstract