Use Of SIMD Instructions Research Articles

We have developed a numerical software library for collisionless N-body simulations named “Phantom-GRAPE” which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton’s forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7–2600 processor, we measure the performance of our library for both of the Newton’s forces and the arbitrarily shaped central forces. In the case of Newton’s forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.

Read full abstract

The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, that drastically affects the execution time of such applications. Its objective is to allow the real-time video compression based on the 3D fast wavelet transform. We show the hardware and software interaction for this multimedia application on a general-purpose processor. First, we mitigate the memory problem by exploiting the memory hierarchy of the processor using several techniques. As for instance, we implement and evaluate the blocking technique. We present two blocking approaches in particular: cube and rectangular, both of which differ in the way the original working set is divided. We also put forward the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Afterwards, we present several optimizations that cannot be applied by the compiler due to the characteristics of the algorithm. On the one hand, the Streaming SIMD Extensions (SSE) are used for some of the dimensions of the sequence (y and time), to reduce the number of floating point instructions, exploiting Data Level Parallelism. Then, we apply loop unrolling and data prefetching to specific parts of the code. On the other hand, the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show speedups of 5x in the execution time over a version compiled with the maximum optimizations of the Intel C/C++ compiler, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform. Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization), causes performance slowdown, demonstrating the effectiveness of our optimizations.

Read full abstract

Use Of SIMD Instructions Research Articles

Related Topics

Articles published on Use Of SIMD Instructions

Increasing the speed of the permutation multiplication operation due to use of simd instructions

A seven-dimensional analysis of hashing methods and its implications on query processing

Low-Power High-Efficiency Video Decoding using General-Purpose Processors

Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture

N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions

The use of coherent ray tracing for physically accurate rendering

Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays

Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Use Of SIMD Instructions Research Articles

Related Topics

Articles published on Use Of SIMD Instructions

Increasing the speed of the permutation multiplication operation due to use of simd instructions

A seven-dimensional analysis of hashing methods and its implications on query processing

Low-Power High-Efficiency Video Decoding using General-Purpose Processors

Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture

N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions

The use of coherent ray tracing for physically accurate rendering

Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays

Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions