Shoot Yourself in the Foot — Efficient Code Causes Inefficiency in Compiler Optimizations
In this paper, we take a different angle to evaluate compiler optimizations than all existing works in compiler testing literature. In particular, we consider a specific scenario in software development, that is, when developers manually optimize a program to improve its performance, do compilers actually generate more efficient code with the help of developers' optimizations?
- Research Article
18
- 10.1051/e3sconf/202339904047
- Jan 1, 2023
- E3S Web of Conferences
The modern period has seen advancements in compiler design, optimization technique, and software system efficiency. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this study. The design of the compiler is advanced by the efficient code that is now structured in research with high-speed performance without manual intervention. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this paper's thorough analysis.
- Conference Article
- 10.1145/2742854.2742888
- May 6, 2015
Today's big data challenge presses for a breakthrough in programming models. A simple programming model capable of both high productivity and high performance is desired.This paper proposes a simple solution to realize a set of restricted yet fundamental productivity features in C-family languages, without sacrificing their efficiency. This is achieved by leveraging a productivity language runtime and compiler analyses. Programmers write a program in the familiar C/C++/ObjectiveC style, without even knowing it is a mix of productivity and efficiency code. The program evolves as both a rapid prototype and efficiency code.
- Conference Article
- 10.1145/255471.255625
- Jan 1, 1990
This presentation examines the effect of a few simple Ada coding styles on compiler optimizations. The discussion should make the user aware of the impact these styles have on the ability of an Ada compiler to optimize the resulting programs. During the presentation, simple examples of each style will be presented, along with the impact of each on a compiler's ability to optimize a program.First, the use of separates adversely affects the performance of optimizations across calls to the separate. Because no analysis of the separate unit can be performed before a use of the separate unit is seen, worst case assumptions must be made about the effects of the separate unit. This inhibits optimizations such as constant folding, common subexpression elimination, and equivalence propagation from occurring across a call to a separate unit. Thus, although separates may be attractive from a development point of view, their use may have a negative impact on system performance.Another feature that limits what an optimizer can do is access types. Because an access type may be assigned a value using unchecked conversion, a compiler is unable to accurately determine what objects have actually been modified when an assignment is made to a dereferenced access type value. For this reason, a compiler must limit optimizations across assignments to a dereferenced access variable. Being aware of how access types limit optimization, a user might be better off to use an alternate method such as the use of array structures where compiler optimizations are not limited.Package machine_code is another area that limits a compiler's ability to optimize across procedure calls. Often, to very efficiently put a few assembly language instructions into a program, a programmer will use package machine_code. The resulting code may not be as efficient as desired. One reason is that the compiler has no semantic information on the effects of the assembly code and therefore cannot optimize across the inserted assembly code. An alternate approach that can be used when a limited set of single assembly instructions is needed is to provide a package of built-in functions whose effect is known to the compiler. Each built-in function is mapped to a single assembly instruction by the compiler. Using this technique, the compiler is not forced to abandon optimization across calls to the built-in functions.A fourth area that limits a compiler's ability to produce efficient code involves the use of unconstrained types. Typically, to make a routine more general, a user may define a procedure that operates on unconstrained array types, even though the procedure may only be used with one type. However, at compile time the compiler does not know the actual sizes of the array objects with which the routine will be called; the compiler must therefore produce, at runtime, code that determines the array sizes. This can be expensive in terms of execution time. A better alternative, in terms of runtime performance, is to write a generic unit and instantiate it for constrained types. This allows the compiler to determine at compile time all the necessary information on sizes to produce very efficient code for each instantiation. This does imply that the efficiency comes at the cost of enlarged code space.The initialization of record and array objects is yet another area that affects runtime performance. Single field initializations do not result in good runtime performance or code size. Significantly smaller code size and faster runtime performance can be obtained if the initialization is done with a static aggregate. This is possible because the compiler can create the aggregate at compile time; at runtime the object can be initialized to the record by a very efficient block copy of the compiler-generated aggregate or array object. However, if the aggregate is not static, i.e., the compiler cannot completely evaluate it at compile time, then aggregate initialization can be expensive.
- Conference Article
14
- 10.1109/iced.2008.4786702
- Dec 1, 2008
Power consumption is a subject of serious consideration in embedded systems design because embedded systems are constrained to stringent power and energy requirements. The mission of lowering power consumption and energy usage of such systems is an important task to prolong their usage in real time situations. In this paper, we study the effects of compiler optimizations on embedded systems energy usage and power consumption in real time situations and the importance of running efficient binary codes in realizing a more power efficient, and better performing embedded system. Compiler optimizations at various levels involving different architectural features have been experimented and it is shown that architecture driven compiler optimizations have a better impact on reducing power consumption and energy usage in embedded system than blind code optimizations.
- Research Article
4
- 10.1504/ijict.2009.026431
- Jan 1, 2009
- International Journal of Information and Communication Technology
Power consumption is a subject of serious consideration in embedded systems design because embedded systems are constrained to stringent power and energy requirements. The mission of lowering power consumption and energy usage of such systems is an important task to prolong their usage in real time situations. In this paper, we study the effects of compiler optimisations on embedded systems energy usage and power consumption in real time situations and the importance of running efficient binary codes in realising a more power efficient and better performing embedded system. Compiler optimisations at various levels involving different architectural features have been experimented and it is shown that architecture driven compiler optimisations have a better impact on reducing power consumption and energy usage in embedded system than blind code optimisations.
- Conference Article
18
- 10.1145/3460945.3464952
- Jun 20, 2021
Because loops execute their body many times, compiler developers place much emphasis on their optimization. Nevertheless, in view of highly diverse source code and hardware, compilers still struggle to produce optimal target code. The sheer number of possible loop optimizations, including their combinations, exacerbates the problem further. Today's compilers use hard-coded heuristics to decide when, whether, and which of a limited set of optimizations to apply. Often, this leads to highly unstable behavior, making the success of compiler optimizations dependent on the precise way a loop has been written. This paper presents LoopLearner, which addresses the problem of compiler instability by predicting which way of writing a loop will lead to efficient compiled code. To this end, we train a neural network to find semantically invariant source-level transformations for loops that help the compiler generate more efficient code. Our model learns to extract useful features from the raw source code and predicts the speedup that a given transformation is likely to yield. We evaluate LoopLearner with 1,895 loops from various performance-relevant benchmarks. Applying the transformations that our model deems most favorable prior to compilation yields an average speedup of 1.14x. When trying the top-3 suggested transformations, the average speedup even increases to 1.29x. Comparing the approach with an exhaustive search through all available code transformations shows that LoopLearner helps to identify the most beneficial transformations in several orders of magnitude less time.
- Book Chapter
5
- 10.1007/978-3-540-75444-2_43
- Jan 1, 2007
One of the outcomes of DARPA’s HPCS program has been the creation of three new high productivity languages: Chapel, Fortress, and X10. While these languages have introduced improvements in language expressiveness and programmer productivity, several technical challenges still remain in delivering high performance with these languages. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations.This paper addresses the problem of efficient code generation for high level array accesses in the X10 language. Two aspects of high level array accesses in X10 are important for productivity but also pose significant performance challenges: the high level accesses are performed through Point objects rather than integer indices, and variables containing references to arrays are rank-independent. Our solution to the first challenge is to extend the X10 compiler with automatic inlining and scalar replacement of Point objects. Our partial solution to the second challenge is to use X10’s dependent type system to enable the programmer to annotate array variable declarations with additional information for the rank and region of the variable, and to allow the compiler to generate efficient code in cases where the dependent type information is available. Although this paper focuses on high level array accesses in X10, our approach is applicable to similar constructs in other languages.Our experimental results for single-thread performance demonstrate that these compiler optimizations can enable high-level X10 array accesses with implicit ranks and Points to improve performance by up to a factor of 5.4× over unoptimized X10 code, and to also achieve performance comparable (from 48% to 100%) to that of lower-level Java programs. These results underscore the importance of the optimization techniques presented in this paper for achieving high performance with high productivity.KeywordsPoint ObjectDependent TypeDefense Advance Research Project AgencyCompiler OptimizationDefense Advance Research Project AgencyThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Conference Article
2
- 10.5555/645989.674322
- Sep 22, 2002
This paper describes a just-in-time (JIT) Java1 compiler for the Intel® Itanium® processor. The Itanium processor is an example of an Explicitly Parallel Instruction Computing (EPIC) architecture and thus relies on aggressive and expensive compiler optimizations for performance. Static compilers for Itanium use aggressive global scheduling algorithms to extract instruction-level parallelism. In a JIT compiler, however, the additional overhead of such expensive optimizations may offset any gains from the improved code.In this paper, we describe lightweight code generation techniques for generating efficient Itanium code. Our compiler relies on two basic methods to generate efficient code. First, the compiler uses inexpensive scheduling heuristics to model the Itanium microarchitecture. Second, the compiler uses the semantics of the Java virtual machine to extract instruction-level parallelism.
- Single Book
217
- 10.1007/978-1-4613-1705-0
- Jan 1, 1988
The Warp machine is a linear array of ten programmable processors and is capable of executing 100 million floating-point operations per second (100 MFLOPS). The individual processors, or cells, derive their performance from a wide instruction set and a high degree of internal pipelining and parallelism. Can an array of high-performance cells be programmed to cooperate at a fine-grain of parallelism? My thesis is that systolic arrays of high-performance cells can be programmed effectively using a high-level language. The solution has two components: a machine abstraction and compiler optimizations for systolic arrays, and code scheduling techniques for horizontally microcoded or VLIW processors. In the proposed machine abstraction, individual cells are programmed in a high-level programming language; inter-cell communication is explicitly specified by asynchronous primitives: receive and send operations. This machine abstraction offers both efficiency and generality. Unidirectional systolic array programs can be compiled into highly efficient code by compiler optimizations that exploit the high-level semantics of asynchronous communication. This abstraction is applicable even for simple implementations with no dynamic flow control hardware by using an efficient compile-time control flow algorithm. This thesis shows that software pipelining is a practical and efficient code scheduling technique for highly parallel and pipelined processors. We have extended the previous scheduling algorithm and introduced a new optimization called modulo variable expansion. We show that near-optimal results can be obtained using software heuristics. This thesis also proposes a unified approach to scheduling both within and across basic blocks called hierarchical reduction. This technique makes software pipelining applicable to all innermost loops, including those containing conditional statements. A consistent performance improvement can thus be obtained for all programs. The ideas and techniques in the thesis have been validated by the implementation of an optimizing compiler for Warp. Optimal performance is obtained for many classical systolic programs. The compiler has made possible the development and implementation of many new, complex systolic algorithms. This thesis research has contributed in extending the domain of systolic processing from implementing simple mathematical recurrences using custom VLSI circuitry to executing arbitrarily complex programs on powerful and programmable processors.
- Research Article
6
- 10.1016/j.parco.2023.103016
- Feb 23, 2023
- Parallel Computing
NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers
- Research Article
9
- 10.1007/bf01807504
- Sep 1, 1992
- Lisp and Symbolic Computation
Common Lisp [25],[26] includes a dynamic datatype system of moderate complexity, as well as predicates for checking the types of language objects. Additionally, an interesting predicate of two “type specifiers”—SUBTYPEP—is included in the language. Thissubtypep predicate provides a mechanism with which to query the Common Lisp type system regarding containment relations among the various built-in and user-defined types. Whilesubtypep is rarely needed by an applications programmer, the efficiency of a Common Lisp implementation can depend critically upon the quality of itssubtypep predicate: the run-time system typically calls uponsubtypep to decide what sort of representations to use when making arrays; the compiler calls uponsubtypep to interpret userdeclarations, on which efficient data representation and code generation decisions are based. As might be expected due to the complexity of the Common Lisp type system, there may be type containment questions which cannot be decided. In these casessubtypep is expected to return “can't determine”, in order to avoid giving an incorrect answer. Unfortunately, most Common Lisp implementations have abused this license by answering “can't determine” in all but the most trivial cases.In particular, most Common Lisp implementations of SUBTYPEP fail on the basic axioms of the Common Lisp type system itself [25][26]. This situation is particularly embarrassing for Lisp-the premier “symbol processing language”—in which the implementation of complex symbolic logical operations should be relatively easy. Sincesubtypep was presumably included in Common Lisp to answer thehard cases of type containment, this “lazy evaluation” limits the usefulness of an important language feature. This paper shows how those type containment relations of Common Lisp which can be decided at all, can be decided simply and quickly by a decision procedure which can dramatically reduce the number of occurrences of the “can't determine” answer fromsubtypep. This decision procedure doesnot require the conversion of a type specifier expression to conjunctive or disjunctive normal form, and therefore does not incur the exponential explosion in space and time that such a conversion would entail. The lattice mechanism described here for decidingsubtypep is also ideal for performingtype inference [2]; the particular implementation developed here, however, is specific to the type system of Common Lisp [4]. Categories and Subject Descriptors: Lisp, dynamic typing, compiler optimization, type inference, decision procedure.
- Conference Article
2
- 10.1109/icse-nier.2019.00035
- May 1, 2019
Highly-configurable systems written in C form our most critical computing infrastructure. The preprocessor is integral to C, because conditional compilation enables such systems to produce efficient object code. However, the preprocessor makes code harder to reason about for both humans and tools. Previous approaches to this challenge developed new program analyses for unpreprocessed source code or developed new languages and constructs to replace the preprocessor. But having special-purpose analyses means maintaining a new toolchain, while new languages face adoption challenges and do not help with existing software. We propose the best of worlds: eliminate the preprocessor but preserve its benefits. Our design replaces preprocessor usage with C itself, augmented with syntax-preserving, backwards-compatible dependent types. We discuss automated conditional compilation to replicate preprocessor performance. Our approach opens new directions for research into new compiler optimizations, dependent types for configurable software, and automated translation away from preprocessor use.
- Research Article
55
- 10.1016/j.jpdc.2019.09.012
- Sep 28, 2019
- Journal of Parallel and Distributed Computing
SIMD programming using Intel vector extensions
- Research Article
1
- 10.1109/99.537097
- Jan 1, 1996
- IEEE Computational Science and Engineering
It is sound programming practice to define the data structures to be computed on before the actual programming effort starts. Actually, this step is crucial to obtaining efficient and portable code. Parallel codes, and more specifically computational science and engineering codes, are no exception to this rule. On the other hand, it is also well known that specific data structure selections can prevent compiler analysis and thereby prohibit program optimization. This problem is best illustrated by the representation of a sparse code in either Fortran with indirect addressing, or in another language with pointer structures. In this situation software maintenance and the effort of producing sparse computation codes become complicated, and most compiler optimizations get disabled. The paper considers how these two opposing interests can be expected to increase in importance for computational science and engineering. Especially in CSE, the need for high performance will push programmers to use more advanced data structures, and optimizing compiler technology will also be stressed more and more.
- Book Chapter
22
- 10.1007/11532378_19
- Jan 1, 2005
Optimizing compilers have a long history of applying loop transformations to C and Fortran scientific applications. However, such optimizations are rare in compilers for object-oriented languages such as C++ or Java, where loops operating on user-defined types are left unoptimized due to their unknown semantics. Our goal is to reduce the performance penalty of using high-level object-oriented abstractions. We propose an approach that allows the explicit communication between programmers and compilers. We have extended the traditional Fortran loop optimizations with an open interface. Through this interface, we have developed techniques to automatically recognize and optimize user-defined array abstractions. In addition, we have developed an adapted constant-propagation algorithm to automatically propagate properties of abstractions. We have implemented these techniques in a C++ source-to-source translator and have applied them to optimize several kernels written using an array-class library. Our experimental results show that using our approach, applications using high-level abstractions can achieve comparable, and in cases superior, performance to that achieved by efficient low-level hand-written codes.