SIMD Units Research Articles

Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging. Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point. By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels. Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time. We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.

Read full abstract

In many cases, applications are not optimized for the hardware on which they run. Several reasons contribute to this unsatisfying situation, such as legacy code, commercial code distributed in binary form, or deployment on compute farms. In fact, backward compatibility of ISA guarantees only the functionality, not the best exploitation of the hardware. In this work, we focus on maximizing the CPU efficiency for the SIMD extensions. The first contribution was originally published in the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS XV, July 2015, Agios Konstantinos, Greece. It is a binary-to-binary optimization framework where loops vectorized for an older version of the processor SIMD extension are automatically converted to a newer one. It is a lightweight mechanism that does not include a vectorizer, but instead leverages what a static vectorizer previously did. We show that many loops compiled for x86 SSE can be dynamically converted to the more recent and more powerful AVX; as well as, how correctness is maintained with regards to challenges such as data dependencies and reductions. We obtain speedups in line with those of a native compiler targeting AVX. The second contribution is the runtime vectorization of loops in binary codes that were not originally vectorized. For this purpose, we use open source frameworks that we have tuned and integrated to (1) dynamically lift the x86 binary into the Intermediate Representation form of the LLVM compiler, (2) abstract hot loops in the polyhedral model, (3) use the power of this mathematical framework to vectorize them, and (4) finally compile them back into executable form using the LLVM Just-In-Time compiler. In most cases, the obtained speedups are close to the number of elements that can be simultaneously processed by the SIMD unit. The re-vectorizer and auto-vectorizer are implemented inside a dynamic optimization platform; it is completely transparent to the user, does not require any rewriting of the binaries, and operates during program execution.

Read full abstract

SIMD Units Research Articles

Related Topics

Articles published on SIMD Units

Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores

Graph-Waving architecture: Efficient execution of graph applications on GPUs

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Development of element-by-element kernel algorithms in unstructured finite-element solvers for many-core wide-SIMD CPUs: Application to earthquake simulation

P oker

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

An efficient SIMD compression format for sparse matrix‐vector multiplication

Efficient Parallel Random Sampling—Vectorized, Cache-Efficient, and Online

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

DITVA: Dynamic Inter-Thread Vectorization Architecture

Improving the Efficiency of GPGPU Work-Queue Through Data Awareness

Scalpel

A hybrid algorithm for parallel molecular dynamics simulations

Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Runtime Vectorization Transformations of Binary Code

Hierarchical parallelisation of functional renormalisation group calculations — hp-fRG

Accelerating multi-channel filtering of audio signal on ARM processors

Can traditional programming bridge the ninja performance gap for parallel computing applications?

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Coordinate multi-core DSP YHFT-QMBase: architecture and implementation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

SIMD Units Research Articles

Related Topics

Articles published on SIMD Units

Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores

Graph-Waving architecture: Efficient execution of graph applications on GPUs

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Development of element-by-element kernel algorithms in unstructured finite-element solvers for many-core wide-SIMD CPUs: Application to earthquake simulation

P oker

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

An efficient SIMD compression format for sparse matrix‐vector multiplication

Efficient Parallel Random Sampling—Vectorized, Cache-Efficient, and Online

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

DITVA: Dynamic Inter-Thread Vectorization Architecture

Improving the Efficiency of GPGPU Work-Queue Through Data Awareness

Scalpel

A hybrid algorithm for parallel molecular dynamics simulations

Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Runtime Vectorization Transformations of Binary Code

Hierarchical parallelisation of functional renormalisation group calculations — hp-fRG

Accelerating multi-channel filtering of audio signal on ARM processors

Can traditional programming bridge the ninja performance gap for parallel computing applications?

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Coordinate multi-core DSP YHFT-QMBase: architecture and implementation