Abstract

We present new versions of the previously published C and CUDA programs for solving the dipolar Gross–Pitaevskii equation in one, two, and three spatial dimensions, which calculate stationary and non-stationary solutions by propagation in imaginary or real time. Presented programs are improved and parallelized versions of previous programs, divided into three packages according to the type of parallelization. First package contains improved and threaded version of sequential C programs using OpenMP. Second package additionally parallelizes three-dimensional variants of the OpenMP programs using MPI, allowing them to be run on distributed-memory systems. Finally, previous three-dimensional CUDA-parallelized programs are further parallelized using MPI, similarly as the OpenMP programs. We also present speedup test results obtained using new versions of programs in comparison with the previous sequential C and parallel CUDA programs. The improvements to the sequential version yield a speedup of 1.1–1.9, depending on the program. OpenMP parallelization yields further speedup of 2–12 on a 16-core workstation, while OpenMP/MPI version demonstrates a speedup of 11.5–16.5 on a computer cluster with 32 nodes used. CUDA/MPI version shows a speedup of 9–10 on a computer cluster with 32 nodes. New version program summaryProgram Title: DBEC-GP-OMP-CUDA-MPI: (1) DBEC-GP-OMP package: (i) imag1dX-th, (ii) imag1dZ-th, (iii) imag2dXY-th, (iv) imag2dXZ-th, (v) imag3d-th, (vi) real1dX-th, (vii) real1dZ-th, (viii) real2dXY-th, (ix) real2dXZ-th, (x) real3d-th; (2) DBEC-GP-MPI package: (i) imag3d-mpi, (ii) real3d-mpi; (3) DBEC-GP-MPI-CUDA package: (i) imag3d-mpicuda, (ii) real3d-mpicuda.Program Files doi:http://dx.doi.org/10.17632/j3z9z379m8.1Licensing provisions: Apache License 2.0Programming language: OpenMP C; CUDA C.Computer: DBEC-GP-OMP runs on any multi-core personal computer or workstation with an OpenMP-capable C compiler and FFTW3 library installed. MPI versions are intended for a computer cluster with a recent MPI implementation installed. Additionally, DBEC-GP-MPI-CUDA requires CUDA-aware MPI implementation installed, as well as that a computer or a cluster has Nvidia GPU with Compute Capability 2.0 or higher, with CUDA toolkit (minimum version 7.5) installed.Number of processors used: All available CPU cores on the executing computer for OpenMP version, all available CPU cores across all cluster nodes used for OpenMP/MPI version, and all available Nvidia GPUs across all cluster nodes used for CUDA/MPI version.Journal reference of previous version: Comput. Phys. Commun. 195 (2015) 117; ibid.200 (2016) 406.Does the new version supersede the previous version?: Not completely. OpenMP version does supersede previous AEWL_v1_0 version, while MPI versions do not supersede previous versions and are meant for execution on computer clusters and multi-GPU workstations.Nature of problem: These programs are designed to solve the time-dependent nonlinear partial differential Gross–Pitaevskii (GP) equation with contact and dipolar interaction in a harmonic anisotropic trap. The GP equation describes the properties of a dilute trapped Bose–Einstein condensate. OpenMP package contains programs for solving the GP equation in one, two, and three spatial dimensions, while MPI packages contain only three-dimensional programs, which are computationally intensive or memory demanding enough to require such level of parallelization.Solution method: The time-dependent GP equation is solved by the split-step Crank–Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation, in either imaginary or real time, over small time steps. The contribution of the dipolar interaction is evaluated by a Fourier transformation to momentum space using a convolution theorem. MPI parallelization is done using the domain decomposition. The method yields the solution of stationary and/or non-stationary problems.Reasons for the new version: Previously published C and Fortran programs [1] for solving the dipolar GP equation are sequential in nature and do not exploit the multiple cores or CPUs found in typical modern computers. A parallel implementation exists, using Nvidia CUDA [2], and both versions are already used within the ultra-cold atoms community [3]. However, CUDA version requires special hardware, which limits its usability. Furthermore, many researchers have access to high performance computer clusters, which could be used to either further speed up the computation, or to work with problems which cannot fit into a memory of a single computer. In light of these observations, we have parallelized all programs using OpenMP, and then extended the parallelization of three-dimensional programs using MPI to distributed-memory clusters. Since the CUDA implementation uses the same algorithm, and thus has the same structure and flow, we have applied the same data distribution scheme to provide the distributed-memory CUDA/MPI implementation of three-dimensional programs.Summary of revisions:Package DBEC-GP-OMP: Previous serial C programs [1] are here improved and then parallelized using OpenMP (package DBEC-GP-OMP). The main improvement consists of switching to real-to-complex (R2C) Fourier transform, which is possible due to the fact that input of the transform is purely real. In this case the result of the transform has Hermitian symmetry, where one half of the values are complex conjugates of the other half. The fast Fourier transformation (FFT) libraries we use can exploit this to compute the result faster, using half the memory.To parallelize the programs, we have used OpenMP with the same approach as described in [4], and extended the parallelization routines to include the computation of the dipolar term. The FFT, used in computation of the dipolar term, was also parallelized in a straightforward manner, by using the built-in support for OpenMP in FFTW3 library [5]. With the introduction of multiple threads memory usage has increased, driven by the need to have some variables private to each thread. To reduce the memory consumed, we resorted to using techniques similar to the ones used in our CUDA implementation [2], i.e., we have reduced the memory required for FFT by exploiting the aforementioned R2C FFT, and reused the memory with pointer aliases whenever possible.Package DBEC-GP-MPI: Next step in the parallelization (package DBEC-GP-MPI) was to extend the programs to run on distributed-memory systems, i.e., on computer clusters using domain decomposition with MPI programming paradigm. We chose to use the newly-implemented threaded versions of the programs as the starting point. Alternatively, we could have used serial versions, and attempt a pure MPI parallelization, however we have found that OpenMP-parallelized routines better exploit the data locality and thus outperform the pure MPI implementation. Therefore, our OpenMP/MPI-parallelized programs are intended to run one MPI process per cluster node, and each process would spawn the OpenMP threads as needed on its cluster node. Note that this is not a requirement, and users may run more than one MPI process per node, but we advise against it due to performance reasons. With the suggested execution strategy (one MPI process per cluster node, each spawning as many threads as CPU cores available), OpenMP threads perform most of the computation, and MPI is used for data exchanges between processes.There are numerous ways to distribute the data between MPI processes, and we decided to use a simple one-dimensional data distribution, also known as slab decomposition. Data is distributed along the first (slowest changing) dimension, which corresponds to NX spatial dimension in our programs (see Fig. 1). Each process is assigned a different portion of the NX dimension, and contains the entire NY and NZ spatial dimensions locally. This allows each process to perform computation on those two dimensions in the same way as before, without any data exchanges. In case the computation requires whole NX dimension to be local to each process, we transpose the data, and after the computation, we transpose the data back. [Display omitted] Transpose routine can be implemented in many ways using MPI, most commonly using MPI_Alltoall function, or using transpose routines from external libraries, like FFTW3 [5] or 2DECOMP&FFT [6]. Since we already rely on FFTW3 library for FFT, we have utilized its dedicated transpose interface to perform the necessary transformations. To speed up transpose operation, we do not perform full transposition of data, but rather leave it locally transposed. That is, we transform from local_NX × NY × NZ, stored in row-major order, to NX × local_NY × NZ in row-major order (where local_NX = NX / number_of_processes, and equivalently for local_NY). This approach has an additional benefit that we do not have to make significant changes in the way array elements are processed, and in most cases we only have to adjust the loop limit of the non-local dimension.Package DBEC-GP-MPI-CUDA: The aforementioned data distribution scheme can be also applied to the CUDA version of programs [2]. However, there is no support for CUDA in FFTW3, and cuFFT (used in CUDA programs for FFT) does not provide equivalent MPI or transpose interface. Instead, we developed our own transpose routines, and used them in FFT computation. One example of manual implementation of transpose routines is shown in Ref. [7], and while we could readily use the same code, we wanted to have the same result as when using FFTW3. To achieve this, we use the same basic principle as in Ref. [7], first we create a custom MPI data type that maps to portions of the data to be exchanged, followed by an all-to-all communication to exchange the data between processes, see Fig. 2 for details. [Display omitted] The implemented transpose routines are also used to compute a distributed-memory FFT, performed over all MPI processes. To divide the computation of a multidimensional FFT, in our case three-dimensional, we use a well-known row–column algorithm. The basic idea of the algorithm is perhaps best explained on a two-dimensional FFT of N×M data, stored in row-major order, illustrated in Fig. 3. First the N one-dimensional FFTs of length M are performed (along the row of data), followed by a transpose, after which data are stored as M×N in row-major format. Now M FFTs of length N can be performed along what used to be a column of original data, but are stored as rows after transposing. Finally, an optional transpose can be performed to return the data in their original N×M form. In three dimensions, we can perform a two-dimensional FFT, transpose the data, and perform the FFT along the third dimension. This algorithm can be easily adapted for distributed memory systems. We use advanced cuFFT interface for local computation of FFT, and use our transpose routine to redistribute the data.Note that DBEC-GP-MPI-CUDA programs can be easily modified to work on a single workstation with multiple GPU cards, or a computer cluster with multiple GPU cards per node. In that case, for each GPU card a separate MPI process should be launched and the programs should be modified to assign a separate GPU card for processes on the same cluster node. [Display omitted] MPI output format: Given that the distributed memory versions of the programs can be used for much larger grid sizes, the output they produce (i.e., the density profiles) can be much larger and difficult to handle. To alleviate this problem somewhat, we have switched to a binary output instead of the textual. This allowed us to reduce the size of files, while still retaining precision. All MPI processes will write the output to the same file, at the corresponding offset, relieving the user of the task of combining the files. The binary output can be subsequently converted to textual, for example by using hexdump command on UNIX-like systems. We have developed a simple script which converts the output from binary to textual format and included it in the software package.Testing results: We have tested all programs on the PARADOX supercomputing facility at the Scientific Computing Laboratory of the Institute of Physics Belgrade. Nodes used for testing had two Intel Xeon E5-2670 CPUs (with a total of 2×8=16 CPU cores) with 32 GB of RAM and one Nvidia Tesla M2090 GPU with 6 GB of RAM, each connected by Infiniband QDR interconnect. The presented results are obtained for arbitrary grid sizes, which are not tailored to maximize performance of the programs. We also stress that execution times and speedups reported here are calculated for critical parallelized parts of the programs performing iterations over imaginary or real time steps, and they exclude time spent on initialization (threads initialization, MPI environment, allocation/deallocation of memory, creating/destroying FFTW plans, I/O operations). As a part of its output, each program separately prints initialization time and time spent on iterations for GP propagation. The latter time is used to calculate a speedup, as a speedup obtained this way does not depend on the number of iterations and is more useful for large numbers of iterations.The testing of OpenMP versions of programs DBEC-GP-OMP was performed with the number of threads varying from 1 to 16. Table 1 and Fig. 4 show the obtained absolute wall-clock times, speedups, and scaling efficiencies, as well as comparison with the previous serial version of programs [1]. As we can see from the table, improvements in the FFT routine used already yield a speedup of 1.3 to 1.9 for single-threaded (T=1) 2d and 3d programs compared to the previous serial programs, and somewhat smaller speedup for 1d programs, 1.1 to 1.3. The use of additional threads brings about further speedup of 2 to 2.5 for 1d programs, and 9 to 12 for 2d and 3d programs. From Fig. 4 we see that for 1d programs, although speedup increases with the number of threads used, the efficiency decreases due to insufficient size of the problem, and one can achieve almost maximal value of speedup already with T=4 threads, while still keeping the efficiency around 50%. We also see, as expected, that speedup and efficiency of 2d and 3d programs behave quite well as we increase the numbers of threads. In particular, we note that the efficiency is always above 60%, making the use of all available CPU cores worthwhile. [Display omitted] For testing of MPI versions we have used a similar methodology to measure the strong scaling performance. For OpenMP/MPI programs DBEC-GP-MPI, the obtained wall-clock times are shown in Table 2, together with the corresponding wall-clock times for the OpenMP programs DBEC-GP-OMP that served as a baseline to calculate speedups. The testing was done for varying number of cluster nodes, from 4 to 32, and the measured speedup ranged from 11 to 16.5. The corresponding graphs of speedups and efficiencies are shown in Fig. 5, where we can see that the speedup grows linearly with the number of nodes used, while the efficiency remains mostly constant in the range between 40% and 60%, thus making the use of OpenMP/MPI programs highly advantageous for problems with large grid sizes. [Display omitted] For CUDA/MPI programs DBEC-GP-MPI-CUDA we observe similar behavior in Table 3 and in Fig. 6. The obtained speedup with N=32 nodes here ranges from 9 to 10, with the efficiency between 30% and 40%. While the efficiency is slightly lower than in the case of OpenMP/MPI programs, which could be expected due to a more complex memory hierarchy when dealing with the multi-GPU system distributed over many cluster nodes, the speedup still grows linearly and makes CUDA/MPI programs ideal choice for use on GPU-enabled computer clusters. Additional benefit of using these programs is their low CPU usage (up to one CPU core), allowing for the possibility that same cluster nodes are used for other CPU-intensive simulations. [Display omitted] The introduction of distributed transposes of data creates some overhead, which negatively impacts scaling efficiency. This is more evident in the CUDA/MPI version, as the transpose algorithm is inferior to the one provided by FFTW3. In our tests, both MPI versions of programs failed to achieve speedup on less than 4 nodes, due to the introduction of the transpose routines. We therefore recommend using MPI versions only on 4 or more cluster nodes.The MPI versions are highly dependent not only on the configuration of the cluster, mainly on the speed of interconnect, but also on the distribution of processes and threads, NUMA configuration, etc. We recommend that users experiment with several different configurations to achieve the best performance. The results presented are obtained without extensive tuning, with the aim to show the base performance.Finally, we note that the best performance can be achieved by evenly distributing the workload among the MPI processes and OpenMP threads, and by using grid sizes which are optimal for FFT. In particular, the programs in DBEC-GP-OMP package have the best performance if NX, NY, and NZ are divisible by the number of OpenMP threads used. Similarly, for DBEC-GP-MPI programs the best performance is achieved if NX and NY are divisible by a product of the number of MPI processes and the number of OpenMP threads used. For DBEC-GP-MPI-CUDA programs, the best performance is achieved if NX and NY are divisible by a product of the number of MPI processes and the number of Streaming Multiprocessors (SM) in the GPU used. For all three packages, the best FFT performance is obtained if NX, NY and NZ can be expressed as 2a3b5c7d11e13f, where e and f are either 0 or 1, and the other exponents are non-negative integer numbers [8].Additional comments, restrictions, and unusual features: MPI programs require that grid size (controlled by input parameters NX, NY and NZ) can be evenly distributed between the processes, i.e., that NX and NY are divisible by the number of MPI processes. Since the data is never distributed along the NZ dimension, there is no such requirement on NZ. Programs will test if these conditions are met, and inform the user if not (by reporting an error). Additionally, MPI versions of CUDA programs require CUDA-aware MPI implementation. This allows the MPI runtime to directly access GPU memory pointers and avoid having to copy the data to main RAM. List of CUDA-aware MPI implementations can be found in Ref. [9].AcknowledgmentsV.L., S.Š. and A.B. acknowledge support by the Ministry of Education, Science, and Technological Development of the Republic of Serbia under projects ON171017, OI1611005, and III43007, as well as SCOPES project IZ74Z0-160453. L.E. Y.-S. acknowledges support by the FAPESP of Brazil under project 2012/21871-7 and 2014/16363-8. P.M. acknowledges support by the Science and Engineering Research Board, Department of Science and Technology, Government of India under project No. EMR/2014/000644. S.K.A. acknowledges support by the CNPq of Brazil under project 303280/2014-0, and by the FAPESP of Brazil under project 2012/00451-0.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call