Efficient automatic scheduling of imaging and vision pipelines for the GPU

Luke Anderson,Tzu-Mao Li,Andrew Adams,Karima Ma,Jonathan Ragan-Kelley,Tian Jin

doi:10.1145/3485486

Abstract

We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels. We address the scalability challenge of extending search-based automatic scheduling to map large real-world programs to the deep hierarchies of memory and parallelism on GPU architectures in reasonable compile time. We achieve this using (1) a two-phase search algorithm that first ‘freezes’ decisions for the lowest cost sections of a program, allowing relatively more time to be spent on the important stages, (2) a hierarchical sampling strategy that groups schedules based on their structural similarity, then samples representatives to be evaluated, allowing us to explore a large space with few samples, and (3) memoization of repeated partial schedules, amortizing their cost over all their occurrences. We guide the process with an efficient cost model combining machine learning, program analysis, and GPU architecture knowledge. We evaluate our method’s performance on a diverse suite of real-world imaging and vision pipelines. Our scalability optimizations lead to average compile time speedups of 49x (up to 530x). We find schedules that are on average 1.7x faster than existing automatic solutions (up to 5x), and competitive with what the best human experts were able to achieve in an active effort to beat our automatic results.

Highlights

There is an increasing demand for high-performance imaging and vision algorithms, but implementing these programs on GPUs involves making optimization choices from a large space of options
The beam search operates in 2 phases for each Func. To help illustrate this process, we introduce a Halide pipeline based on our previous stencil chain example: Func intermed, output; intermed(x, y) = input(x-1, y) + input(x, y) + input(x+1, y); output(x, y) = intermed(x-1, y) + intermed(x, y) + intermed(x+1, y); Halide represents this algorithm as a directed acyclic graph of Funcs, where output is a consumer of producer intermed, which in turn is a consumer of in
6 RESULTS We evaluate our autoscheduler on a diverse set of 17 imaging and vision programs (Fig. 6), including 15 applications from the Halide repository: bilateral grid, local laplacian, non-local means, lens blur, camera pipe, a 32-stage stencil chain, Harris corner detection, histogram equalize, max filter, unsharp mask, interpolate, a neural network conv layer with ReLU activation, SGEMM (Single float precision General Matrix Multiply), an IIR blur, BGU

Summary

INTRODUCTION

There is an increasing demand for high-performance imaging and vision algorithms, but implementing these programs on GPUs involves making optimization choices from a large space of options Generality, and computational cost, our cost model combines program analysis and machine learning: we extract program features that capture the architectural intricacies required to predict performance of GPU programs, and provide these features as input to a lightweight neural network that predicts performance It evaluates tens of thousands of schedules per second, vs seconds or minutes to compile and benchmark a single one. A new automatic scheduling algorithm that scales orders of magnitude better than prior work, making it possible to efficiently explore a large, rich space of GPU schedules It delivers state of the art performance on a suite of real world imaging and vision pipelines, with a geomean speedup of up to 1.7× over the prior state of the art GPU autoscheduler [Sioutas et al 2020], and competitive (0.95×) with what the best human experts were able to achieve in an active effort to beat our automatic results. Fusion introduces additional choices for the level in the loop nest at which to fuse each stage, and additional tiling options within those fused blocks, further exacerbating the scalability problem

The Cost of Evaluating an Option

Limitations of Graph Partitioning

OVERVIEW OF THE AUTOSCHEDULER

Hierarchically Sampling the Search Space

Freezing Low Cost Stages

Memoization of Partial Schedules

OUR SEARCH ALGORITHM

Choosing Serial Loops

Choosing Thread Loops

Block Loops

Hierarchical Sampling

Avoiding Known Bad States

Pruning

Lowering Optimizations

EVALUATING SCHEDULES

Features

Cost Model

Training Procedure

RESULTS

Post-Compile Filtering

Analysis

Cost Model Evaluation

Manual Schedules Outside the Search Space

RELATED WORK

LIMITATIONS & FUTURE WORK

CONCLUSION

A FEATURIZATION

B COST MODEL COMPONENTS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the ACM on Programming Languages	Publication Date: Oct 15, 2021
Citations: 8	License type: cc-by

R Discovery Prime

R Discovery Prime

Efficient automatic scheduling of imaging and vision pipelines for the GPU

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the ACM on Programming Languages

Lead the way for us

Similar Papers

End-to-end High Dynamic Range Camera Pipeline Optimization
Luis E Garcia Capel ... Avinash Sharma
-
Luis E Garcia Capel, et. al.Luis E Garcia Capel ... Avinash Sharma
01 Jun 2021
01 Jun 2021

Reducing parallelizing compilation time by removing redundant analysis
Rina Fujino ... Moriyuki Takamura
-
Rina Fujino, et. al.Rina Fujino ... Moriyuki Takamura
21 Oct 2016
21 Oct 2016

Causality analysis of synchronous programs with delayed actions
J Brandt ... K Schneider
-
J Brandt, et. al.J Brandt ... K Schneider
22 Sep 2004
22 Sep 2004

Flow analysis of dynamic logic programs
Saumya K Debray
The Journal of Logic Programming | VOL. 7
Saumya K DebraySaumya K Debray
01 Sep 1989
The Journal of Logic Programming | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient automatic scheduling of imaging and vision pipelines for the GPU

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the ACM on Programming Languages