Abstract

The oil and gas industry makes use of computational intensive algorithms to provide an image of the subsurface. The image is obtained by sending wave energy into the subsurface and recording the signal required for a seismic wave to reflect back to the surface from the Earth interfaces that may have different physical properties. A seismic wave is usually generated by shots of known frequencies, placed close to the surface on land or close to the water surface in the sea. Returning waves are usually recorded in time by hydrophones in a marine environment or by geophones during land acquisition. The goal of seismic imaging is to transform the seismograms to a spatial image of the subsurface. Migration algorithms produce an image of the subsurface given the seismic data measured at the surface. In this thesis we focus on solving the Helmholtz equation which represents the wave propagation in the frequency domain. We can easily convert from the time domain to the frequency domain and vice-versa using the Fourier transformation. A discretization with second-order finite differences gives a sparse linear system of equations that needs to be solved for each frequency. Two- as well as three-dimensional problems are considered. Krylov subspace methods such as Bi-CGSTAB and IDR(s) have been chosen as solvers. Since the convergence of the Krylov subspace solvers deteriorates with an increasing wave number, a shifted Laplacian multigrid preconditioner is used to improve the convergence. Here, we extend the matrix-dependent multigrid method to solve complex-valued matrices in three dimensions. As the smoother, we have considered parallelizable methods such as weighted Jacobi (?-Jacobi), multi-colored Gauss-Seidel and damped multi-colored Gauss-Seidel (?-GS). The implementation of the preconditioned solver on a CPU (Central Processing Unit) is compared to an implementation on the GPU (Graphics Processing Units or graph- ics card) using CUDA (Compute Unified Device Architecture). The results show that in two dimensions the preconditioned Bi-CGSTAB method on the GPU as well as the pre- conditioned IDR(s) method on a single GPU are about 30 times faster than on a single- threaded CPU. To achieve double precision accuracy on the GPU we have used the iterative refinement in Chapter 2. However, problems of realistic size are too large to fit in the memory of one GPU. One solution for this is to use multiple GPUs. A currently widely used architecture consists of a multi-core computer connected to one or at most two GPUs. Moreover, those GPUs can have different characteristics and memory sizes. A setup with four or more identical GPUs is rather uncommon, but it would be ideal from a memory point of view. It would imply that the maximum memory is four times more than on a single GPU. How- ever GPUs are connected to a PCI bus and in some cases two GPUs share the same PCI bus, which creates data transfer limitations. To summarize, using multi-GPUs increases the total memory size but data transfer problems appear. Therefore, in Chapter 3 we consider different multi-GPU approaches and understand how data transfer affects the performance of a Krylov subspace solver with shifted Laplace multigrid preconditioner for the three-dimensional Helmholtz equation using CUDA (Compute Unified Device Architecture). Two multi-GPU approaches are considered: data parallelism and split of the algorithm. Their implementations on a multi-GPU architecture are compared to a multi-threaded CPU and single GPU implementation. The results show that the data parallel implementation suffers from communication between GPUs and the CPU, but is still a number of times faster compared to many-cores. The split of the algorithm across GPUs limits communication and delivers speedups comparable to a single GPU implementation. As a geophysical application which requires an efficient numerical method we con- sider 3-D reverse time migration with the constant-density acoustic wave equation in Chapter 4. The idea of migration in the time domain is to calculate the forward wave- field by injecting the source wavelet. Secondly, we compute the wavefield backward in time by injecting the recorded signal at the receiver locations. Subsequently, we cross- correlate the forward and backward wavefields at given timesteps. An explicit finite- difference scheme in the time domain is a common choice. However, it requires a significant amount of disk space to store the forward wavefields. The advantage of migration with a frequency domain solver is that it does not require large amounts of disk space to store the snapshots. However, a disadvantage is the memory usage of the solver. As GPUs have generally much less memory available than CPUs, this impacts the size of the problem significantly. The frequency-domain approach simplifies the correlation of the source and receiver wavefields, but requires the solution of a large sparse linear system of equations. The question is whether migration in the frequency domain can compete with a time-domain implementation when both are performed on a parallel architecture. Both methods are naturally parallel over shots, but the frequency-domain method is also parallel over frequencies. If we have a sufficiently large number of compute nodes, we can compute the result for each frequency in parallel and the required time is dominated by the number of iterations for the highest frequency. Here, GPUs are used as accelerators and not as independent compute nodes. We optimize the throughput of the latter with dynamic load balancing, asynchronous I/O and compression of snapshots. Since the frequency- domain solver employs a matrix-dependent prolongation, the coarse grid operators required more storage than available on GPUs for problems of realistic sizes. An alternative to the depth migration is least-squares migration (LSM). LSM was introduced as a bridge between full waveform inversion and migration. Like migration, LSM does not attempt to retrieve the background velocity model, however, like full wave- form inversion the modeled data should fit the observations. In Chapter 5 an efficient LSM algorithm is presented using several enhancements. Firstly, a frequency decimation approach is introduced that makes use of the redundant information present in the data. It leads to a speedup of LSM, whereas the impact on accuracy is kept minimal. Secondly, to store the sparse discretization and matrix-dependent prolongation matrices efficiently, a new matrix storage format VCRS (Very Compressed Row Storage) is presented. This format is capable of handling lossless compression. It does not only reduce the size of the stored matrix by a certain factor but also increases the efficiency of the matrix-vector computations. The study shows that the effect of lossless and lossy compression with a proper choice of the compression parameters are positive. Thirdly, we accelerate the LSM engine by GPUs. A GPU is used as an accelerator, where the data is partially transferred to a GPU to execute a set of operations, or as a replacement, where the complete data is stored in the GPU memory. We demonstrate that using GPU as a replacement leads to higher speedups and allows us to solve larger problem sizes. Summarizing the effects of each improvement, the resulting speedup can be at least an order of magnitude compared to the original LSM method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call