Closed-Loop Binary Optimization: Integrating De-Identified Production Telemetry into the Build Lifecycle
Modern optimization techniques for performance mainly operate on the final binary emitted by the compiler. Profile-Guided Optimization (PGO) is a model of performance optimization: rather than applying heuristics to select optimizations at compile time, PGO selects optimizations based on run-time profiling of the program. Static compilation cannot predict the dynamic control flow. The cache behavior will also depend on the workload running in production machines. By measuring the execution in production, compilers can learn the frequency of hot paths and the requirements of branch prediction, caches, and instruction scheduling. Instrumentation overhead is reduced by a load-test infrastructure that runs copies of production traffic. Privacy-sensitive user data is sanitized by privacy-preserving de-identification pipelines. Query structure is preserved to allow possible optimizations in the process of data management. Continuous profiling maintains its effectiveness over time as both execution environments and workloads change. Autotuning, the process of finding optimal compiler settings for the specific workload, is increasingly realized through machine learning techniques. When deployed as standard infrastructure at the production grade, binary optimization offers new economic value through better resource utilization and lower latency services, and can offer a virtuous circle of improvement for high-performance digital infrastructure everywhere through using real-world telemetry to feed into the compiler toolchain.
- Research Article
9
- 10.1109/12.485567
- Mar 1, 1996
- IEEE Transactions on Computers
This paper introduces a novel superscalar micro-architecture, called IAS-S, and its related software techniques. We treat two basic problems in superscalar machines. First, we seek a feasible hardware platform which allows the compiler to perform more aggressive instruction scheduling. Second, we develop a good way of communication between the instruction scheduler and register allocator to avoid inadequate register allocation resulting in poor instruction schedules. For the first part, IAS-S employs the Conjugate Register File (CRF) scheme to support multilevel instruction boosting so that a greater amount of instruction-level parallelism in a program can be identified at compile time. For the second part, the instruction scheduling in the IAS-S compiler consists of two passes, prepass and postpass, and a scheduling-conflict graph is built for the register allocator during the prepass scheduling. In this manner, the register allocator can take the potential benefit for later postpass instruction scheduling into account and thus prevents inadequate register allocation.
- Research Article
62
- 10.1145/1324969.1324973
- Dec 1, 2007
- ACM Transactions on Embedded Computing Systems
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected. Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations, which take the cache behavior into account, might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult. This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate intertask cache interferences. We combine static cache analysis and cache-locking mechanisms to ensure that all intratask conflicts, and consequently, memory access times, are exactly predictable. The results of our experiments demonstrate the capability of our framework to describe cache behavior at compile time. We compare our timing approach with a system equipped with a nonpartitioned, but statically, locked data cache. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, demonstrating that our fully predictable scheme does not compromise the performance of the transformed programs.
- Conference Article
2
- 10.1145/3372799.3397167
- Jun 16, 2020
Modern compilers are still built using technology that existed decades ago. These include basic algorithms and techniques for lexing, parsing, data-flow analysis, data dependence analysis, vectorization, register allocation, instruction selection, and instruction scheduling. It is high time that we modernize our compiler toolchain. In this talk, I will show the path to the modernization of one important compiler technique -- vectorization. Vectorization was first introduced in the era of Cray vector processors during the 1980's. In modernizing vectorization, I will first show how to use new techniques that better target modern hardware. While vector supercomputers need large vectors, which are only available by parallelizing loops, modern SIMD instructions efficiently work on short vectors. Thus, in 2000, we introduced Superword Level Parallelism (SLP) based vectorization. SLP finds short vector instructions within basic blocks, and by loop unrolling we can convert vector parallelism to SLP. Next, I will show how we can take advantage of the power of modern computers for compilation, by using more accurate but expensive techniques to improve SLP vectorization. Due to the hardware resource constraints of the era, like many other compiler optimizations, SLP implementation was a greedy algorithm. In 2018, we introduced goSLP, which uses integer linear programming to find an optimal instruction packing strategy and achieves 7.58% geomean performance improvement over the LLVM's SLP implementation on SPEC2017fp C/C++ programs. Finally, I will show how to truly modernize a compiler by automatically learning the necessary components of the compiler with Ithemal and Vemal. The optimality of goSLP is under LLVM's simple per instruction additive cost model that fits within the Integer programming framework. However, the actual cost of execution in a modern out-of-order, pipelined, superscalar processor is much more complex. Manually building such cost models as well as manually developing compiler optimizations is costly, tedious, error-prone and is hard to keep up with the architectural changes. Ithemal is the first learnt cost model for predicting the throughput of x86 basic blocks. It not only significantly outperforms (more than halves the error) state-of-the-art analytical hand-written tools like llvm-mca, but also is learnt from data requiring minimal human effort. Vemal is a learnt policy for end-to-end vectorization as opposed to tuning heuristics, which outperforms LLVM's SLP vectorizer. These data-driven techniques can help achieve state-of-the-art results while also reducing the development and maintenance burden of the compiler developer.
- Research Article
50
- 10.1016/s0898-1221(97)00184-3
- Nov 1, 1997
- Computers & Mathematics with Applications
Using integer linear programming for instruction scheduling and register allocation in multi-issue processors
- Conference Article
6
- 10.4230/oasics.wcet.2016.1
- May 2, 2017
- DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)
Measurement-based timing analysis (MBTA) is often used to determine the timing behaviour of software programs embedded in safety-aware real-time systems deployed in various industrial domains including automotive and railway. MBTA methods rely on some form of instrumentation, either at hardware or software level, of the target program or fragments thereof to collect execution-time measurement data. A known drawback of software-level instrumentation is that instrumentation itself does affect the timing and functional behaviour of a program, resulting in the so-called probe effect: leaving the instrumentation code in the final executable can negatively affect average performance and could not be even admissible under stringent industrial qualification and certification standards; removing it before operation jeopardizes the results of timing analysis as the WCET estimates on the instrumented version of the program cannot be valid any more due, for example, to the timing effects incurred by different cache alignments. In this paper, we present a novel approach to mitigate the impact of instrumentation code on cache behaviour by reducing the instrumentation overhead while at the same time preserving and consolidating the results of timing analysis.
- Research Article
5
- 10.1145/3406536
- Oct 3, 2020
- ACM Transactions on Embedded Computing Systems
Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging. Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point. By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels. Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time. We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.
- Conference Article
4
- 10.1145/1244002.1244154
- Mar 11, 2007
For many embedded applications, program code size is a critical design factor for its relationship with limited memory, energy and communication bandwidth. While pursuing better code redundancy elimination in compilation time, people also began to focus on better encoding. Some RISC processors, such as ARM, MIPS and UniCore, support a 32bit/16bit dual-width instruction set. Mixed code generation is introduced in expectation of achieving both higher code density from the 16-bit instruction set and good performance from the 32-bit one, with little extra cost.We describe a new fine-grained mixed code generation scheme in this paper. We introduce into the 32-bit ISA a new 16-bit Mode-Changing instruction set which has the following features: firstly, the operation of the instructions are very common in UniCore32 programs and are appropriate to be coded into 16 bits; secondly, they can switch the current processor mode while performing their own operations. We implement the mixed code generation at link time in our compilation toolchain. Our experiments show that this scheme is successful in better encoding a program's computations to reduce code size without sacrificing performance. In addition, there are little modifications to micro-architecture, ensuring good compatibility with the original instruction set architecture.
- Conference Article
8
- 10.1145/2016604.2016614
- May 3, 2011
Even parts of a program that are sequential or just inherently difficult to parallelize can be optimized for ILP. For instance, eliminating loop overheads and potential pipeline stalls from control flow can alleviate performance bottle-necks. Unfortunately, static compilation is limited in the extent to which it can identify opportunities to apply such optimizations. Generating code dynamically at run time, however, create much more efficient applications by usin information not available at compile time. We demonstrate our approach on a sparse-matrix PET scan code by aggressive unrolling loops and specializing code via dynamic code generation. We leverage task-level parallelism by having an auxiliary processor core concurrently generate code and feed it to the core executing the application. Our approach to fast code generation leverages patching and concatenating prepared code skeletons.
- Conference Article
1
- 10.5753/sscad.2024.244522
- Oct 23, 2024
Software-level approximations, such as loop perforation, function replacement, and memoization, can significantly enhance application performance and energy efficiency during compile time. However, approximating compilers often require extensive user intervention and lack the capability for real-time adaptation. This paper presents RAAS, a framework that integrates just-in-time recompilation with an automated evaluation system to create a general-purpose software approximation system with minimal user involvement. Our framework can apply input-aware approximations without needing a separate testing phase by continuously monitoring the target application and recompiling code blocks. We evaluated the framework with a set of resilient benchmarks while also comparing its performance with a similar framework focused on static compilation of approximations. Our findings demonstrate speedups of up to 6.3x with quality degradation limited to 30%, achieving competitive results to a static compilation with a shorter convergence time.
- Conference Article
9
- 10.1109/cgo.2006.1
- Mar 26, 2006
Static compilers use profiling to predict run-time program behavior. Generally, this requires multiple input sets to capture wide variations in run-time behavior. This is expensive in terms of resources and compilation time. We introduce a new mechanism, 2D-profiling, which profiles with only one input set and predicts whether the result of the profile would change significantly across multiple input sets. We use 2D-profiling to predict whether a branch's prediction accuracy varies across input sets. The key insight is that if the prediction accuracy of an individual branch varies significantly over a profiling run with one input set, then it is more likely that the prediction accuracy of that branch varies across input sets. We evaluate 2D-profiling with the SPEC CPU 2000 integer benchmarks and show that it can identify input-dependent branches accurately.
- Conference Article
1
- 10.1145/2544137.2544148
- Feb 15, 2014
Software pipelining exploits instruction-level parallelism from loops. In static compilers, it has been one of the most efficient optimizations for wide-issue architectures. However, the compilation time is at least O(|V|3) (V: the set of operations in a loop) and in the worst-case exponential. This paper extends software pipelining to dynamic compilers. We present a novel and simple algorithm with linear time O(|V| + |E|) (E: the set of edges in the dependence graph of a loop). Preliminary experiments show the method is light-weight and generates optimal or near-optimal schedules.
- Research Article
9
- 10.1145/2508148.2485937
- Jun 23, 2013
- ACM SIGARCH Computer Architecture News
Work in quantum computer architecture has focused on communication, layout and fault tolerance, largely driven by Shor's factorization algorithm. For the first time, we study a larger range of benchmarks and find that another critical issue is the generation of code sequences for quantum rotation operations. Specifically, quantum algorithms require arbitrary rotation angles, while quantum technologies and error correction codes provide only for discrete angles and operators. A sequence of quantum machine instructions must be generated to approximate the arbitrary rotation to the required precision. While previous work has focused exclusively on static compilation, we find that some applications require dynamic code generation and explore the advantages and disadvantages of static and dynamic approaches. We find that static code generation can, in some cases, lead to a terabyte of machine code to support required rotations. We also find that some rotation angles are unknown until run time, requiring dynamic code generation. Dynamic code generation, however, exhibits significant trade-offs in terms of time overhead versus code size. Furthermore, dynamic code generation will be performed on classical (non-quantum) computing resources, which may or may not have a clock speed advantage over the target quantum technology. For example, operations on trapped ions run at kilohertz speeds, but superconducting qubits run at gigahertz speeds. We introduce a new method for compiling arbitrary rotations dynamically, designed to minimize compilation time. The new method reduces compilation time by up to five orders of magnitude while increasing code size by one order of magnitude. We explore the design space formed by these trade-offs of dynamic versus static code generation, code quality, and quantum technology. We introduce several techniques to provide smoother trade-offs for dynamic code generation and evaluate the viability of options in the design space.
- Conference Article
26
- 10.1145/2485922.2485937
- Jun 23, 2013
Work in quantum computer architecture has focused on communication, layout and fault tolerance, largely driven by Shor's factorization algorithm. For the first time, we study a larger range of benchmarks and find that another critical issue is the generation of code sequences for quantum rotation operations. Specifically, quantum algorithms require arbitrary rotation angles, while quantum technologies and error correction codes provide only for discrete angles and operators. A sequence of quantum machine instructions must be generated to approximate the arbitrary rotation to the required precision.While previous work has focused exclusively on static compilation, we find that some applications require dynamic code generation and explore the advantages and disadvantages of static and dynamic approaches. We find that static code generation can, in some cases, lead to a terabyte of machine code to support required rotations. We also find that some rotation angles are unknown until run time, requiring dynamic code generation. Dynamic code generation, however, exhibits significant trade-offs in terms of time overhead versus code size. Furthermore, dynamic code generation will be performed on classical (non-quantum) computing resources, which may or may not have a clock speed advantage over the target quantum technology. For example, operations on trapped ions run at kilohertz speeds, but superconducting qubits run at gigahertz speeds.We introduce a new method for compiling arbitrary rotations dynamically, designed to minimize compilation time. The new method reduces compilation time by up to five orders of magnitude while increasing code size by one order of magnitude.We explore the design space formed by these trade-offs of dynamic versus static code generation, code quality, and quantum technology. We introduce several techniques to provide smoother trade-offs for dynamic code generation and evaluate the viability of options in the design space.
- Conference Article
101
- 10.1109/micro.1996.566468
- Dec 23, 2002
The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs.
- Conference Article
211
- 10.5555/243846.243895
- Dec 2, 1996
The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs.