Profiling Heterogeneous Computing Performance with VTune Profiler

Vladimir Tsymbal,Alexandr Kurylev

doi:10.1145/3456669.3456678

Abstract

Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. Memory Latency metrics are helping to determine most inefficient data accesses on a source line level. Supplementary GPU Instruction Count analysis clarifies with instruction set in a kernel generated by a Compiler. The GPU analyses in VTune are well developed for OpenCL language and run-time, however the most recent SYCL language and its extension Data Parallel C++ along with Level Zero run-time are supported as well, running on all Intel GPUs from Gen9 HD Graphics to Intel Iris Xe Graphics (a discrete GPU card). Results of performance profiling on different GPU architectures will be presented in the session. VTune Profiler for GPUs is a newly extended toolset which is being actively developed along with development of new acceleration architectures at Intel. New features and analysis concepts are constantly appearing in the tool fulfilling the needs of software architects and developers.

Full Text