NVIDIA's Architecture Research Articles

An overview of approaches to parallelization of grid-based numerical methods for solving shallow water equations for multiprocessor systems and graphics processors is presented. A multithreaded approach for shared-memory computing systems implemented on the basis of the OpenMP programming interface and a geometric decomposition approach with message-passing using the MPI library for distributed-memory computers are described. Multithreading for programming GPUs based on the OpenACC software interface is considered. For the COASTOX-UN system of two-dimensional modeling of hydrodynamics, sediment and radionuclide transport in river systems and coastal areas of the seas, the parallelization of its hydrodynamic model COASTOX-HD was carried out. In the developed numerical model, the shallow water equations are solved by finite-volume numerical methods on unstructured computational grids with triangular cells of variable size. The parallelization is implemented using a hybrid MPI+OpenACC approach targeting multiprocessor systems and GPUs. For multiprocessor computers, geometric decomposition and MPI-based messaging are used, and for GPUs, multithreading is implemented using OpenACC directives. The performance of the developed parallel hydrodynamic model was evaluated during the calculation of typical problems of hydrodynamics of shallow water bodies, river flood, and tsunami wave run-up on the coast on a Dell Precision Workstation 7920 multi-core workstation with two 20-core Intel Xeon Gold 6230 processors and NVIDIA Quadro RTX 5000 and NVIDIA GeForce RTX 3080 graphics cards. It is shown that the developed model has significantly accelerated the simulation on the considered multiprocessor system and the considered GPUs. The acceleration on GPUs depends on the size of the computational grid, increasing to saturation with an increase in the number of grid cells. It is established that for the developed parallel model, whose numerical schemes are related to algorithms with low computational intensity, the memory bandwidth of the NVIDIA architecture GPUs is a more important limiting factor of acceleration than their performance.

Read full abstract

Multiprecision modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve many simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is an important problem that fits this profile. We are developing a framework that enables generation of highly efficient implementations of exponentiation operations for different NVIDIA GPU architectures and problem instances. One of the challenges in generating such code is that NVIDIA's PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the computations changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating PTX code for each combination, executing it, validating the numerical result, and evaluating its performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations for the NVIDIA GTX 580. Our goal for the framework is to enable users to relatively quickly find the best configuration for each new GPU architecture. However, in migrating to the GTX 680, which has three times as many cores as the GTX 580, we found that the best performance our system could achieve was significantly less than for the GTX 580. The decrease was traced to a radical shift in the NVIDIA architecture that greatly reduces the storage resources for each core. Further analysis and feasibility simulations indicate that it should be possible, through changes in our code generators to adapt for different storage models, to take greater advantage of the parallelism on the GTX 680. That will add a new dimension to our search space, but will also give our framework greater flexibility for dealing with future architectures.

Read full abstract

NVIDIA's Architecture Research Articles

Articles published on NVIDIA's Architecture

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Parallelization of numerical solutions of shallow water equations by the finite volume method for implementation on multiprocessor systems and graphics processors

GPU acceleration of local and semilocal density functional calculations in the SPARC electronic structure code.

Fine‐Grained Memory Profiling of GPGPU Kernels

Billion degree of freedom granular dynamics simulation on commodity hardware via heterogeneous data-type representation

High-throughput fuzzy clustering on heterogeneous architectures

Comparison of analytical and ML-based models for predicting CPU–GPU data transfer time

Exploiting Bank Conflict-based Side-channel Timing Leakage of GPUs

ИССЛЕДОВАНИЕ МОДЕЛЕЙ РЕАЛИЗАЦИИ ВОЛНОВОГО АЛГОРИТМА ДВИЖЕНИЯ РОБОТА ДЛЯ АРХИТЕКТУРЫ NVIDIA В ТЕХНОЛОГИИ CUDA

DawnCC

Efficient implementation of morphological index for building/shadow extraction from remotely sensed images

The Advanced Synthetic Aperture Sonar Imaging eNgine (ASASIN), a time-domain backprojection beamformer using graphics processing units

Implementation of Sorting Algorithms with CUDA: An Empirical Study

Exposing errors related to weak memory in GPU applications

Evaluation of the 3-D finite difference implementation of the acoustic diffusion equation model on massively parallel architectures

Comprehensive Evaluation of a New GPU-based Approach to the Shortest Path Problem

Execution time optimisation using delayed multidimensional retiming

SEARCH-BASED AUTOMATIC CODE GENERATION FOR MULTIPRECISION MODULAR EXPONENTIATION ON MULTIPLE GENERATIONS OF GPU

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

NVIDIA's Architecture Research Articles

Articles published on NVIDIA's Architecture

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Parallelization of numerical solutions of shallow water equations by the finite volume method for implementation on multiprocessor systems and graphics processors

GPU acceleration of local and semilocal density functional calculations in the SPARC electronic structure code.

Fine‐Grained Memory Profiling of GPGPU Kernels

Billion degree of freedom granular dynamics simulation on commodity hardware via heterogeneous data-type representation

High-throughput fuzzy clustering on heterogeneous architectures

Comparison of analytical and ML-based models for predicting CPU–GPU data transfer time

Exploiting Bank Conflict-based Side-channel Timing Leakage of GPUs

ИССЛЕДОВАНИЕ МОДЕЛЕЙ РЕАЛИЗАЦИИ ВОЛНОВОГО АЛГОРИТМА ДВИЖЕНИЯ РОБОТА ДЛЯ АРХИТЕКТУРЫ NVIDIA В ТЕХНОЛОГИИ CUDA

DawnCC

Efficient implementation of morphological index for building/shadow extraction from remotely sensed images

The Advanced Synthetic Aperture Sonar Imaging eNgine (ASASIN), a time-domain backprojection beamformer using graphics processing units

Implementation of Sorting Algorithms with CUDA: An Empirical Study

Exposing errors related to weak memory in GPU applications

Evaluation of the 3-D finite difference implementation of the acoustic diffusion equation model on massively parallel architectures

Comprehensive Evaluation of a New GPU-based Approach to the Shortest Path Problem

Execution time optimisation using delayed multidimensional retiming

SEARCH-BASED AUTOMATIC CODE GENERATION FOR MULTIPRECISION MODULAR EXPONENTIATION ON MULTIPLE GENERATIONS OF GPU