Preconditioners based on domain decomposition appear natural for the Krylov solution of implicitly discretized partial differential equations (PDEs) on parallel computers. Two-scale preconditioners (involving a global coarse-grid solve, independent solves over interfaces connecting the coarse-grid points, and independent subdomain solves) have been known since the early 1980s to be “near optimal” in the sense of ensuring a bounded, or at most logarithmically growing, iteration count as the mesh is refined. As a result, the refinement of the mesh can be chosen locally on the basis of truncation error, and the granularity of the domain decomposition can be chosen globally on the basis of parallel computing considerations with only mild effects on the convergence rate of the algorithm. However, overall computational complexity depends not only on the algebraic convergence rate, but also on the operation counts of the components of the preconditioner that must be applied at each iteration. The costs of solving the subdomain systems and the crosspoint system show superlinear growth in their respective (and inversely related) sizes. On the subdomains, the superlinear terms arise from arithmetic only; in the crosspoint system the cost of nonlocal data exchange is also superlinear. These factors make the preconditioner granularity and the choice of its components problem- and machine-dependent compromises. The tradeoffs involved are illustrated through numerical experiments on both shared- and distributed-memory computers for convection-diffusion problems. Because of the development of boundary layers, these problems benefit from local mesh refinement, which is straightforward to accommodate within the domain decomposition framework in a locally uniform sense, but which introduces load balancing as a further consideration in selecting the granularity of the preconditioner. In spite of the tradeoffs, cumulative speedups are obtainable out to at least medium-scale granularity (up to 64 processors in our tests). The largest problems involve $\mathcal{O}(10^5 )$ unknowns partitioned into $\mathcal{O}(10^3 )$ subdomains and converge in $\mathcal{O}(10)$ iterations requiring $\mathcal{O}(1)$ seconds on the Intel iPSC/860.
Read full abstract