Abstract

We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels. We address the scalability challenge of extending search-based automatic scheduling to map large real-world programs to the deep hierarchies of memory and parallelism on GPU architectures in reasonable compile time. We achieve this using (1) a two-phase search algorithm that first ‘freezes’ decisions for the lowest cost sections of a program, allowing relatively more time to be spent on the important stages, (2) a hierarchical sampling strategy that groups schedules based on their structural similarity, then samples representatives to be evaluated, allowing us to explore a large space with few samples, and (3) memoization of repeated partial schedules, amortizing their cost over all their occurrences. We guide the process with an efficient cost model combining machine learning, program analysis, and GPU architecture knowledge. We evaluate our method’s performance on a diverse suite of real-world imaging and vision pipelines. Our scalability optimizations lead to average compile time speedups of 49x (up to 530x). We find schedules that are on average 1.7x faster than existing automatic solutions (up to 5x), and competitive with what the best human experts were able to achieve in an active effort to beat our automatic results.

Highlights

  • There is an increasing demand for high-performance imaging and vision algorithms, but implementing these programs on GPUs involves making optimization choices from a large space of options

  • The beam search operates in 2 phases for each Func. To help illustrate this process, we introduce a Halide pipeline based on our previous stencil chain example: Func intermed, output; intermed(x, y) = input(x-1, y) + input(x, y) + input(x+1, y); output(x, y) = intermed(x-1, y) + intermed(x, y) + intermed(x+1, y); Halide represents this algorithm as a directed acyclic graph of Funcs, where output is a consumer of producer intermed, which in turn is a consumer of in

  • 6 RESULTS We evaluate our autoscheduler on a diverse set of 17 imaging and vision programs (Fig. 6), including 15 applications from the Halide repository: bilateral grid, local laplacian, non-local means, lens blur, camera pipe, a 32-stage stencil chain, Harris corner detection, histogram equalize, max filter, unsharp mask, interpolate, a neural network conv layer with ReLU activation, SGEMM (Single float precision General Matrix Multiply), an IIR blur, BGU

Read more

Summary

INTRODUCTION

There is an increasing demand for high-performance imaging and vision algorithms, but implementing these programs on GPUs involves making optimization choices from a large space of options Generality, and computational cost, our cost model combines program analysis and machine learning: we extract program features that capture the architectural intricacies required to predict performance of GPU programs, and provide these features as input to a lightweight neural network that predicts performance It evaluates tens of thousands of schedules per second, vs seconds or minutes to compile and benchmark a single one. A new automatic scheduling algorithm that scales orders of magnitude better than prior work, making it possible to efficiently explore a large, rich space of GPU schedules It delivers state of the art performance on a suite of real world imaging and vision pipelines, with a geomean speedup of up to 1.7× over the prior state of the art GPU autoscheduler [Sioutas et al 2020], and competitive (0.95×) with what the best human experts were able to achieve in an active effort to beat our automatic results. Fusion introduces additional choices for the level in the loop nest at which to fuse each stage, and additional tiling options within those fused blocks, further exacerbating the scalability problem

The Cost of Evaluating an Option
Limitations of Graph Partitioning
OVERVIEW OF THE AUTOSCHEDULER
Hierarchically Sampling the Search Space
Freezing Low Cost Stages
Memoization of Partial Schedules
OUR SEARCH ALGORITHM
Choosing Serial Loops
Choosing Thread Loops
Block Loops
Hierarchical Sampling
Avoiding Known Bad States
Pruning
Lowering Optimizations
EVALUATING SCHEDULES
Features
Cost Model
Training Procedure
RESULTS
Post-Compile Filtering
Analysis
Cost Model Evaluation
Manual Schedules Outside the Search Space
RELATED WORK
LIMITATIONS & FUTURE WORK
CONCLUSION
A FEATURIZATION
B COST MODEL COMPONENTS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call