Single-precision Operations Research Articles

Purpose The purpose of this paper is to improve the computational speed of solving nonlinear dynamics by using parallel methods and mixed-precision algorithm on graphic processing units (GPUs). The computational efficiency of traditional central processing units (CPUs)-based computer aided engineering software has been difficult to satisfy the needs of scientific research and practical engineering, especially for nonlinear dynamic problems. Besides, when calculations are performed on GPUs, double-precision operations are slower than single-precision operations. So this paper implemented mixed precision for nonlinear dynamic problem simulation using Belytschko-Tsay (BT) shell element on GPU. Design/methodology/approach To minimize data transfer between heterogeneous architectures, the parallel computation of the fully explicit finite element (FE) calculation is realized using a vectorized thread-level parallelism algorithm. An asynchronous data transmission strategy and a novel dependency relationship link-based method, for efficiently solving parallel explicit shell element equations, are used to improve the GPU utilization ratio. Finally, this paper implements mixed precision for nonlinear dynamic problems simulation using the BT shell element on a GPU and compare it to the CPU-based serially executed program and a GPU-based double-precision parallel computing program. Findings For a car body model containing approximately 5.3 million degrees of freedom, the computational speed is improved 25 times over CPU sequential computation, and approximately 10% over double-precision parallel computing method. The accuracy error of the mixed-precision computation is small and can satisfy the requirements of practical engineering problems. Originality/value This paper realized a novel FE parallel computing procedure for nonlinear dynamic problems using mixed-precision algorithm on CPU-GPU platform. Compared with the CPU serial program, the program implemented in this article obtains a 25 times acceleration ratio when calculating the model of 883,168 elements, which greatly improves the calculation speed for solving nonlinear dynamic problems.

Since its creation, the ImageNet-1k benchmark set has played a significant role as a benchmark for ascertaining the accuracy of different deep neural net (DNN) models on the image classification problem. Moreover, in recent years it has also served as the principal benchmark for assessing different approaches to DNN training. Finishing a 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires $10^{18}$1018 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish $3 \times 10^{17}$3×1017 single precision operations per second (according to the Nov 2018 Top 500 results). If we can make full use of the computing capability of the fastest supercomputer, we should be able to finish the training in several seconds. Over the last two years, researchers have focused on closing this significant performance gap through scaling DNN training to larger numbers of processors. Most successful approaches to scaling ImageNet training have used the synchronous mini-batch stochastic gradient descent (SGD). However, to scale synchronous SGD one must also increase the batch size used in each iteration. Thus, for many researchers, the focus on scaling DNN training has translated into a focus on developing training algorithms that enable increasing the batch size in data-parallel synchronous SGD without losing accuracy over a fixed number of epochs. In this paper, we investigate supercomputers’ capability of speeding up DNN training. Our approach is to use a large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on five neural networks: AlexNet, AlexNet-BN, GNMT, ResNet-50, and ResNet-50-v2 trained with large datasets while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from Goyal et al. [1] , our approach shows higher test accuracy on batch sizes that are larger than 16K. When we use the same baseline, our results are better than Goyal et al. for all the batch sizes (Fig. 20 ). Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe, Facebook's PyTorch, and Google's TensorFlow. The difference between this paper and the conference-version of our work [2] includes: (1) we implement our approach on Google's cloud Tensor Processing Unit (TPU) platform, which verifies our previous success on CPUs and GPUs. (2) we scale the batch size of ResNet-50-v2 to 32K and achieve 76.3 percent accuracy, which is better than the 75.3 percent accuracy achieved in our conference paper. (3) we apply our approach to Google's Neural Machine Translation (GNMT) application, which helps us to achieves 4× speedup on the cloud TPUs.

Single-precision Operations Research Articles

Related Topics

Articles published on Single-precision Operations

Seismic modeling and inversion using half-precision floating-point numbers

A novel parallel finite element procedure for nonlinear dynamic problems using GPU and mixed-precision algorithm

Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs

Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support

Compressible lattice Boltzmann simulations on high‐performance and low‐cost GeForce GPU

New Features of Parallel Implementation of N-Body Problems on GPU

High performance and energy efficient single‐precision and double‐precision merged floating‐point adder on FPGA

An area efficient multi-mode quadruple precision floating point adder

Guest Editorial: Computing Frontiers

Configurable Multimode Embedded Floating-Point Units for FPGAs

High-speed and low-power reconfigurable architectures of 2-digit two-dimensional logarithmic number system-based recursive multipliers

The PMS project: Poor man's supercomputer

Multiple precision square root using the Dwandwa square-root algorithm

Parallel efficiency for solving linear systems on dawn1000

Modified straight division: A computer implementation of multiple-precision division

Fast rounding in multiprecision floating-slash arithmetic

Fast Multiple-Precision Evaluation of Elementary Functions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Single-precision Operations Research Articles

Related Topics

Articles published on Single-precision Operations

Seismic modeling and inversion using half-precision floating-point numbers

A novel parallel finite element procedure for nonlinear dynamic problems using GPU and mixed-precision algorithm

Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs

Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support

Compressible lattice Boltzmann simulations on high‐performance and low‐cost GeForce GPU

New Features of Parallel Implementation of N-Body Problems on GPU

High performance and energy efficient single‐precision and double‐precision merged floating‐point adder on FPGA

An area efficient multi-mode quadruple precision floating point adder

Guest Editorial: Computing Frontiers

Configurable Multimode Embedded Floating-Point Units for FPGAs

High-speed and low-power reconfigurable architectures of 2-digit two-dimensional logarithmic number system-based recursive multipliers

The PMS project: Poor man's supercomputer

Multiple precision square root using the Dwandwa square-root algorithm

Parallel efficiency for solving linear systems on dawn1000

Modified straight division: A computer implementation of multiple-precision division

Fast rounding in multiprecision floating-slash arithmetic

Fast Multiple-Precision Evaluation of Elementary Functions