Prefix Sum Research Articles

A Reduction – an accumulation over a set of values, using an associative and commutative operator – is a common computation in many numerical computations, including scientific computations, machine learning, computer vision, and financial analytics. Contemporary polyhedral-based compilation techniques make it possible to optimize reductions, such as prefix sums, in which each component of the reduction’s output potentially shares computation with another component in the reduction. Therefore an optimizing compiler can identify the computation shared between multiple components and generate code that computes the shared computation only once. These techniques, however, do not support reductions that – when phrased in the language of the polyhedral model – span multiple dependent statements. In such cases, existing approaches can generate incorrect code that violates the data dependences of the original, unoptimized program. In this work, we identify and formalize the optimization of dependent reductions as an integer bilinear program. We present a heuristic optimization algorithm that uses an affine sequential schedule of the program to determine how to simplfy reductions yet still preserve the program’s dependences. We demonstrate that the algorithm provides optimal complexity for a set of benchmark programs from the literature on probabilistic inference algorithms, whose performance critically relies on simplifying these reductions. The complexities for 10 of the 11 programs improve siginifcantly by factors at least of the sizes of the input data, which are in the range of 10 4 to 10 6 for typical real application inputs. We also confirm the significance of the improvement by showing speedups in wall-clock time that range from 1.1x to over 10 6 x.

Read full abstract

In this paper we propose a high-level approach to developing GPU applications based on the Vulkan API. The purpose of the work is to reduce the complexity of developing and debugging applications that implement complex algorithms on the GPU using Vulkan. The proposed approach uses the technology of code generation by translating a C++ program into an optimized implementation in Vulkan, which includes automatic shader generation, resource binding, and the use of synchronization mechanisms (Vulkan barriers). The proposed solution is not a general-purpose programming technology, but specializes in specific tasks. At the same time, it has extensibility, which allows to adapt the solution to new problems. For single input C++ program, we can generate several implementations for different cases (via translator options) or different hardware. For example, a call to virtual functions can be implemented either through a switch construct in a kernel, or through sorting threads and an indirect dispatching via different kernels, or through the so-called callable shaders in Vulkan. Instead of creating a universal programming technology for building various software systems, we offer an extensible technology that can be customized for a specific class of applications. Unlike, for example, Halide, we do not use a domain-specific language, and the necessary knowledge is extracted from ordinary C++ code. Therefore, we do not extend with any new language constructs or directives and the input source code is assumed to be normal C++ source code (albeit with some restrictions) that can be compiled by any C++ compiler. We use pattern matching to find specific patterns (or patterns) in C++ code and convert them to GPU efficient code using Vulkan. Pattern are expressed through classes, member functions, and the relationship between them. Thus, the proposed technology makes it possible to ensure a cross-platform solution by generating different implementations of the same algorithm for different GPUs. At the same time, due to this, it allows you to provide access to specific hardware functionality required in computer graphics applications. Patterns are divided into architectural and algorithmic. The architectural pattern defines the domain and behavior of the translator as a whole (for example, image processing, ray tracing, neural networks, computational fluid dynamics and etc.). Algorithmic pattern express knowledge of data flow and control and define a narrower class of algorithms that can be efficiently implemented in hardware. Algorithmic patterns can occur within architectural patterns. For example, parallel reduction, compaction (parallel append), sorting, prefix sum, histogram calculation, map-reduce, etc. The proposed generator works on the principle of code morphing. The essence of this approach is that, having a certain class in the program and transformation rules, one can automatically generate another class with the desired properties (for example, the implementation of the algorithm on the GPU). The generated class inherits from the input class and thus has access to all data and functions of the input class. Overriding virtual functions in generated class helps user to carefully connect generated code to the other Vulkan code written by hand. Shaders can be generated in two variants: OpenCL shaders for google “clspv” compiler and GLSL shaders for an arbitrary GLSL compiler. Clspv variant is better for code which intensively uses pointers and the GLSL generator is better if specific HW features are used (like hardware ray tracing acceleration). We have demonstrated our technology on several examples related to image processing and ray tracing on which we get 30-100 times acceleration over multithreaded CPU implementation.

Read full abstract

Prefix Sum Research Articles

Related Topics

Articles published on Prefix Sum

Multi-Function Scan Circuit for Assisting the Parallel Computational Map Pattern

Totally-ordered Sequential Rules for Utility Maximization

Smoother: on-the-fly processing of interactome data using prefix sums.

A portable C++ library for memory and compute abstraction on multi‐core CPUs and GPUs

When does 0–1 Principle Hold for Prefix Sums?

Elementary Algorithms – Prefix Sum

An Improved Parallel Prefix Sums Algorithm

Formal verification of parallel prefix sum and stream compaction algorithms in CUDA

Parallel Makespan Calculation for Flow Shop Scheduling Problem with Minimal and Maximal Idle Time

Parameterized Splitting of Summed Volume Tables

Simplifying dependent reductions in the polyhedral model

Parallel Differential Evolutionary Particle Filtering Algorithm Based on the CUDA Unfolding Cycle

Автоматизация разработки на Vulkan: предметно-ориентированный подход

Lambda calculus with algebraic simplification for reduction parallelisation: Extended study

Optimized Parallel Prefix Sum Algorithm on Optoelectronic Biswapped-Torus Architecture

Optimal Parallel Prefix Sum Computation on Three-Dimensional Torus Network

A completely parallel surface reconstruction method for particle-based fluids

Compact Fenwick trees for dynamic ranking and selection

Differentially Private Real-time Streaming Data Publication Based on Sliding Window under Exponential Decay

Sparse prefix sums: Constant-time range sum queries over sparse multidimensional data cubes

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Prefix Sum Research Articles

Related Topics

Articles published on Prefix Sum

Multi-Function Scan Circuit for Assisting the Parallel Computational Map Pattern

Totally-ordered Sequential Rules for Utility Maximization

Smoother: on-the-fly processing of interactome data using prefix sums.

A portable C++ library for memory and compute abstraction on multi‐core CPUs and GPUs

When does 0–1 Principle Hold for Prefix Sums?

Elementary Algorithms – Prefix Sum

An Improved Parallel Prefix Sums Algorithm

Formal verification of parallel prefix sum and stream compaction algorithms in CUDA

Parallel Makespan Calculation for Flow Shop Scheduling Problem with Minimal and Maximal Idle Time

Parameterized Splitting of Summed Volume Tables

Simplifying dependent reductions in the polyhedral model

Parallel Differential Evolutionary Particle Filtering Algorithm Based on the CUDA Unfolding Cycle

Автоматизация разработки на Vulkan: предметно-ориентированный подход

Lambda calculus with algebraic simplification for reduction parallelisation: Extended study

Optimized Parallel Prefix Sum Algorithm on Optoelectronic Biswapped-Torus Architecture

Optimal Parallel Prefix Sum Computation on Three-Dimensional Torus Network

A completely parallel surface reconstruction method for particle-based fluids

Compact Fenwick trees for dynamic ranking and selection

Differentially Private Real-time Streaming Data Publication Based on Sliding Window under Exponential Decay

Sparse prefix sums: Constant-time range sum queries over sparse multidimensional data cubes