On Thin Air Reads Towards an Event Structures Model of Relaxed Memory
This is the first paper to propose a pure event structures model of relaxed memory. We propose confusion-free event structures over an alphabet with a justification relation as a model. Executions are modeled by justified configurations, where every read event has a justifying write event. Justification alone is too weak a criterion, since it allows cycles of the kind that result in so-called thin-air reads. Acyclic justification forbids such cycles, but also invalidates event reorderings that result from compiler optimizations and dynamic instruction scheduling. We propose a notion well-justification, based on a game-like model, which strikes a middle ground. We show that well-justified configurations satisfy the DRF theorem: in any data-race free program, all well-justified configurations are sequentially consistent. We also show that rely-guarantee reasoning is sound for well-justified configurations, but not for justified configurations. For example, well-justified configurations are type-safe. Well-justification allows many, but not all reorderings performed by relaxed memory. In particular, it fails to validate the commutation of independent reads. We discuss variations that may address these shortcomings.
- Research Article
14
- 10.23638/lmcs-15(1:33)2019
- Mar 29, 2019
- Logical Methods in Computer Science
To model relaxed memory, we propose confusion-free event structures over an alphabet with a justification relation. Executions are modeled by justified configurations, where every read event has a justifying write event. Justification alone is too weak a criterion, since it allows cycles of the kind that result in so-called thin-air reads. Acyclic justification forbids such cycles, but also invalidates event reorderings that result from compiler optimizations and dynamic instruction scheduling. We propose the notion of well-justification, based on a game-like model, which strikes a middle ground. We show that well-justified configurations satisfy the DRF theorem: in any data-race free program, all well-justified configurations are sequentially consistent. We also show that rely-guarantee reasoning is sound for well-justified configurations, but not for justified configurations. For example, well-justified configurations are type-safe. Well-justification allows many, but not all reorderings performed by relaxed memory. In particular, it fails to validate the commutation of independent reads. We discuss variations that may address these shortcomings.
- Research Article
1
- 10.1023/a:1008125919892
- Aug 1, 1999
- Journal of VLSI signal processing systems for signal, image and video technology
We consider the increased performance that can be obtained by using in concert, three previously proposed (and in two cases used in commercial systems) ideas. These ideas are aggressive dynamic (run time) instruction scheduling, reuse of decoded instructions, and trace scheduling. We show that these ideas complement and support one another. Hence, while each of these ideas has been shown to have merit in its own right, when used in concert, we claim the overall advantage is greater than that obtained by using any one singly. To support this claim, we present the results from running several common multimedia kernels. Overall, these results show an average speedup of 3.50 times what can be had by using dynamic instruction scheduling alone.
- Research Article
88
- 10.1109/2.30730
- Jul 1, 1989
- Computer
An overview of and survey solutions to the problem of instruction scheduling for pipelined computers are provided. The author demonstrated that dynamic instruction scheduling can provide performance improvements not possible with static scheduling alone. He describes a high-performance computer, the Astronautics ZS-1, which uses novel methods for implementing dynamic scheduling and which can outperform computers using similar-speed technologies that rely solely on state-of-the-art static scheduling techniques. >
- Research Article
- 10.5075/epfl-thesis-4541
- Jan 1, 2010
- Infoscience (Ecole Polytechnique Fédérale de Lausanne)
Transactional memory (TM) has shown potential to simplify the task of writing concurrent programs. TM shifts the burden of managing concurrency from the programmer to the TM algorithm. The correctness of TM algorithms is generally proved manually. The goal of this thesis is to provide the mathematical and software tools to automatically verify TM algorithms under realistic memory models. Our first contribution is to develop a mathematical framework to capture the behavior of TM algorithms and the required correctness properties. We consider the safety property of opacity and the liveness properties of obstruction freedom and livelock freedom. We build a specification language of opacity. We build a framework to express hardware relaxed memory models. We develop a new high-level language, Relaxed Memory Language (RML), for expressing concurrent algorithms with a hardware-level atomicity of instructions, whose semantics is parametrized by various relaxed memory models. We express TM algorithms like TL2, DSTM, and McRT STM in our framework. The verification of TM algorithms is difficult because of the unbounded number, length, and delay of concurrent transactions and the unbounded size of the memory. The second contribution of the thesis is to identify structural properties of TM algorithms which allow us to reduce the unbounded verification problem to a language-inclusion check between two finite state systems. We show that common TM algorithms satisfy these structural properties. The third contribution of the thesis is our tool FOIL for model checking TM algorithms. FOIL takes as input the RML description of a TM algorithm and the description of a memory model. FOIL uses the operational semantics of RML to compute the language of the TM algorithm for two threads and two variables. FOIL then checks whether the language of the TM algorithm is included in the specification language of opacity. FOIL automatically determines the locations of fences, which if inserted, ensure the correctness of the TM algorithm under the given memory model. We use FOIL to verify DSTM, TL2, and McRT STM under the memory models of sequential consistency, total store order, partial store order, and relaxed memory order.
- Conference Article
- 10.1109/indicon.2017.8487498
- Dec 1, 2017
A typical superscalar processor fetches, decodes and executes several instructions. The incoming instruction stream is then analyzed for data dependencies and resource dependencies. Instructions are distributed to functional units based on availability of functional unit and data by the dispatcher. This is referred as dynamic instruction scheduling. This paper proposes a dynamic scheduling for the superscalar processor that consists of four functional units, instruction analyzer window of 8 instructions, instruction decoder and dispatcher with register bank. Four independent out of order instructions are executed in parallel. To improve the performance of the processor in terms of speed Tomasulo algorithm is implemented using Isim simulator in Xilinx 14.5 version. To demonstrate potential of the architecture, FIR filter is implemented and compared in terms of execution time with and without dynamic scheduling and also with respect to scalar processor architecture.
- Conference Article
1
- 10.5753/sbac-pad.1999.19788
- Sep 29, 1999
There are two distinct groups of research into ILP. Those that strongly favour static instruction scheduling and those that favour dynamic instruction scheduling. This paper introduces powerful static and dynamic scheduling models and combines them within the framework of a single simulation environment. Both individual models achieve respectable speedups; dynamic schedullng significantly out-performs static scheduling when an idealised processor model with perfect branch prediction is used. However, when a realistic branch predictor is substituted, the roles are reversed, and static scheduling achieves the higher performance. Similarly, static scheduling performs better in the absence of branch prediction or when processor resources are restricted. Finally, we combine static scheduling with out-of-order instruction issue. Disappointingly, when an ideal out-of-order processor is used, scheduled code fails to match the performance of unscheduled code. Furthermore, with realistic branch predictlon, out-of-order issue fails to improve the performance of scheduled code.
- Research Article
5
- 10.1145/325096.325140
- May 1, 1990
- ACM SIGARCH Computer Architecture News
article Free Access Share on An investigation of static versus dynamic scheduling Authors: Carl E. Love University of Colorado at Boulder, 2505 Table Mesa Dr. Boulder, Colorado University of Colorado at Boulder, 2505 Table Mesa Dr. Boulder, ColoradoView Profile , Harry F. Jordan University of Colorado at Boulder, Dept. of Electrical Engineering, Campus Box 425 Boulder, Colorado University of Colorado at Boulder, Dept. of Electrical Engineering, Campus Box 425 Boulder, ColoradoView Profile Authors Info & Claims ACM SIGARCH Computer Architecture NewsVolume 18Issue 2SIJune 1990 pp 192–201https://doi.org/10.1145/325096.325140Published:01 May 1990Publication History 6citation450DownloadsMetricsTotal Citations6Total Downloads450Last 12 Months23Last 6 weeks7 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my Alerts New Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
- Conference Article
2
- 10.1109/iccse.2016.7581696
- Aug 1, 2016
Extensive use of dynamic instruction scheduling technique has made it an essential content of Computer Architecture (CA) course. Practical teaching for this content, however, is always a weak link in the teaching of CA. According to current teaching methods, teachers just explain the principle of dynamic instruction scheduling by traditional or multimedia instruction. This kind of method is far from effective for students to understand dynamic scheduling technique. Therefore, it is necessary to adopt experiment-based methods. In this paper, we propose a lightweight simulator framework for the teaching of CA, especially for pipelining and dynamic instruction scheduling. We firstly design a basic simulator called PipelineSim which supports a basic five-stage MIPS pipeline. Scoreboarding and Tomasulo are then introduced to be integrated into PipelineSim. Students can implement either Scoreboarding or Tomasulo algorithm based on this framework instead of just understanding these two mechanisms by lectures. We also provide an example of designed experiment, which can be the reference or teaching resource for teachers to use.
- Book Chapter
4
- 10.1007/3-540-36498-6_9
- Jan 1, 2003
In a multithreaded program running on a multiprocessor platform, different processors may observe operations in different orders. This may lead to surprising results not anticipated by the programmer. The problem is exacerbated by common compiler and hardware optimization techniques. A memory (consistency) model provides a contract between the system designer and the software designer that constrains the order in which operations are observed. Every memory model strikes some balance between strictness (simplifying program behavior) and laxness (permitting greater optimization). With its emphasis on cross-platform compatibility, the Java programming language needs a memory model that is satisfactory to language users and implementors. Everyone in the Java community must be able to understand the Java memory model and its ramifications. The description of the original Java memory model suffered from ambiguity and opaqueness, and attempts to interpret it revealed serious deficiencies. Two memory models have been proposed as replacements. Unfortunately, these two new models are described at different levels of abstraction and are represented in different formats, making it difficult to compare them. In this paper we formalize these models and develop a unified representation of them, using Abstract State Machines. Using our formal specifications, we relate the new Java memory models to the Location Consistency memory model and to each other.KeywordsShared MemoryMain MemoryMemory ModelLocation ConsistencyAbstract State MachineThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Conference Article
3
- 10.1109/async.2001.914078
- Mar 11, 2001
This paper proposes an asynchronous superscalar architecture called DCAP to exploit instruction-level parallelism based on a novel dynamic instruction scheduling technique. The proposed technique not only has an efficient implementation using asynchronous micropipelines, it also minimizes the amount of hardware required for instruction scheduling when compared to standard schemes used in synchronous superscalar processors. In addition, the proposed technique for dynamic instruction scheduling also exploits the dependency patterns in the instruction streams for enhanced performance. DCAP is a fully functional model of an asynchronous superscalar processor and supports register renaming and precise interrupts. A detailed performance analysis of DCAP on realistic benchmarks is presented.
- Research Article
9
- 10.1145/1809028.1806636
- Jun 5, 2010
- ACM SIGPLAN Notices
The most intuitive memory model for shared-memory multithreaded programming is sequential consistency (SC), but it disallows the use of many compiler and hardware optimizations thereby impacting performance. Data-race-free (DRF) models, such as the proposed C++0x memory model, guarantee SC execution for datarace-free programs. But these models provide no guarantee at all for racy programs, compromising the safety and debuggability of such programs. To address the safety issue, the Java memory model, which is also based on the DRF model, provides a weak semantics for racy executions. However, this semantics is subtle and complex, making it difficult for programmers to reason about their programs and for compiler writers to ensure the correctness of compiler optimizations. We present the DRFx memory model, which is simple for programmers to understand and use while still supporting many common optimizations. We introduce a memory model (MM) exception which can be signaled to halt execution. If a program executes without throwing this exception, then DRFx guarantees that the execution is SC. If a program throws an MM exception during an execution, then DRFx guarantees that the program has a data race. We observe that SC violations can be detected in hardware through a lightweight form of conflict detection. Furthermore, our model safely allows aggressive compiler and hardware optimizations within compiler-designated program regions. We formalize our memory model, prove several properties about this model, describe a compiler and hardware design suitable for DRFx, and evaluate the performance overhead due to our compiler and hardware requirements.
- Book Chapter
1
- 10.1007/978-3-319-05119-2_15
- Jan 1, 2014
We study two operational semantics for relaxed memory models. Our first formalization is based on the notion of write-buffers which is pervasive in the memory models literature. We instantiate the Total Store Ordering TSO and Partial Store Ordering PSO memory models in this framework. Memory models that support more aggressive relaxations e.g. read-to-read reordering are not easily described with write-buffers. Our second framework is based on a general notion of speculative computation. In particular we allow the prediction of function arguments, and execution ahead of time e.g. by branch prediction. While technically more involved than write-buffers, this model is more expressive and can encode all the Sparc family of memory models: TSO, PSO and Relaxed Memory Ordering RMO. We validate the adequacy of our instantiations of TSO and PSO by formally comparing their write-buffer and speculative formalizations. The use of operational semantics techniques is paramount for the tractability of these proofs.
- Book Chapter
- 10.1007/978-3-642-39304-4_2
- Jan 1, 2013
The Central Processing Unit (CPU) in a microprocessor is responsible for running machine instructions as fast as possible so that the machine performance is at its maximum level. While simple in design, in-order execution processors provide sub-optimal performance, because any delay in instruction processing blocks the entire instruction stream. To overcome this limitation, modern highperformance designs use out-of-order (OoO) instruction scheduling to better exploit available Instruction-Level Parallelism (ILP), and both static (compilerassisted) and dynamic (hardware-assisted) scheduling solutions are possible. The hardware-assisted scheduling integrates an OoO core that requires a complex dynamic instruction scheduler and additional datapath structures are utilized to hold the in-flight instructions in program order to support the reconstruction of precise program state. The logic becomes even more complex when superscalar (those capable of executing multiple instructions every clock cycle) designs are used. This chapter gives a brief introduction to instruction scheduling on pipelined superscalar architectures, and, then, explains some of the keystone static and dynamic instruction scheduling algorithms.
- Research Article
83
- 10.1016/0743-7315(92)90052-o
- Aug 1, 1992
- Journal of Parallel and Distributed Computing
Programming for different memory consistency models
- Research Article
2
- 10.1145/2925988
- Sep 15, 2016
- ACM Transactions on Programming Languages and Systems
The most intuitive memory model for shared-memory multi-threaded programming is sequential consistency (SC), but it disallows the use of many compiler and hardware optimizations and thus affects performance. Data-race-free (DRF) models, such as the C++11 memory model, guarantee SC execution for data-race-free programs. But these models provide no guarantee at all for racy programs, compromising the safety and debuggability of such programs. To address the safety issue, the Java memory model, which is also based on the DRF model, provides a weak semantics for racy executions. However, this semantics is subtle and complex, making it difficult for programmers to reason about their programs and for compiler writers to ensure the correctness of compiler optimizations. We present the drf x memory model, which is simple for programmers to understand and use while still supporting many common optimizations. We introduce a memory model (MM) exception that can be signaled to halt execution. If a program executes without throwing this exception, then drf x guarantees that the execution is SC. If a program throws an MM exception during an execution, then drf x guarantees that the program has a data race. We observe that SC violations can be detected in hardware through a lightweight form of conflict detection. Furthermore, our model safely allows aggressive compiler and hardware optimizations within compiler-designated program regions. We formalize our memory model, prove several properties of this model, describe a compiler and hardware design suitable for drf x , and evaluate the performance overhead due to our compiler and hardware requirements.