Shape-Value Abstraction for Verifying Linearizability
This paper presents a novel abstraction for heap-allocated data structures that keeps track of both their shape and their contents. By combining this abstraction with thread-local analysis and rely-guarantee reasoning, we can verify a collection of fine-grained blocking and non-blocking concurrent algorithms for an arbitrary (unbounded) number of threads. We prove that these algorithms are linearizable, namely equivalent (modulo termination) to their sequential counterparts.
- Research Article
- 10.5075/epfl-thesis-7993
- Jan 1, 2017
- Infoscience (Ecole Polytechnique Fédérale de Lausanne)
The increase in the number of cores in processors has been an important trend over the past decade. In order to be able to efficiently use such architectures, modern software must be scalable: performance should increase proportionally to the number of allotted cores. While some software is inherently parallel, with threads seldom having to coordinate, a large fraction of software systems are based on shared state, to which access must be coordinated. This shared state generally comes in the form of a concurrent data structure. It is thus essential for these concurrent data structures to be correct, fast and scalable, regardless of the scenario (i.e.,different workloads, processors, memory units, programming abstractions). Nevertheless, few or no generic approaches exist that result in concurrent data structures which scale in a large spectrum of environments. This dissertation introduces a set of generic methods that allows to build - irrespective of the deployment environment - fast and scalable concurrent data structures. We start by identifying a set of sufficient conditions for concurrent search data structures to scale and perform well regardless of the workloads and processors they are running on.We introduce âasynchronized concurrencyâ, a paradigm consisting of four complementary programming patterns, which calls for the design of concurrent search data structures to resemble that of their sequential counterparts. Next, we show that there is virtually no practical situation in which one should seek a âtheoretically wait-freeâ algorithm at the expense of a state-of-the-art blocking algorithm in the case of search data structures: blocking algorithms are simple, fast, and can be made practically wait-free. We then focus on the memory unit, and provide a method yielding fast concurrent data structures even when the memory is non-volatile, and structures must be recoverable in case of a transient failure. We start by introducing a generic technique that allows us to avoid doing expensive writes to non-volatile memory by using a fast software cache. We also study memory management, and propose a solution tailored to concurrent data structures that uses coarse-grained memory management in order to avoid logging. Moreover, we argue for the use of lock-free algorithms in this non-volatile context, and show how by optimizing them we can avoid expensive logging operations. Together, the techniques we propose enable us to avoid any form of logging in the common case, thus significantly improving concurrent data structure performance when using non-volatile RAM. Finally, we go beyond basic interfaces, and look at scalable partitioned data structures implemented through a transactional interface. We present multiversion timestamp locking (MVTL),a new genre of multiversion concurrency control algorithms for serializable transactions. The key idea behind MVTL is simple and novel: lock individual time points instead of locking objects or versions. We provide several MVTL-based algorithms, that address limitations of current concurrency-control schemes. In short, by spanning workloads, processors, storage abstractions, and system sizes, this dissertation takes a step towards concurrent data structures that are universally scalable.
- Conference Article
- 10.1109/clei.2016.7833377
- Oct 1, 2016
This work addresses the resolution of the transport (advection-diffusion) equation in 3D using an explicit scheme for the finite d ifference method. Our initative is motivated by the advantages offered by this scheme for parallel processing. We propose three implementations, a sequential code (in C) and two parallel versions (C-CUDA and C with OpenMP). The experimental comparison is focused on the performance of each implementation using different grid sizes, and in the case of the OpenMP implementation, several number of threads. Additionally, we measured the accurancy of this scheme when the detail of the discretization grows. The results show that the parallel implementations reach significant speed up compared with the sequential counterpart. In addition, the GPU variant offers an further runtime reduction of up to 10x.
- Conference Article
5
- 10.1109/apeie.2018.8545197
- Oct 1, 2018
This work proposes the implementation of scalable concurrent pool based on diffraction trees. Developed pool ensures localization of addresses to shared variables to maximize its throughput. The proposed approaches increase the throughput at high and low workload and provides acceptable level of FIFO/LIFO-order of operation execution and is characterized by low latency of tree traversal. We analyze the efficiency of developed pool. The pool provides large scalability of multithreaded programs compared with similar implementation of pool based on diffraction trees. Developed pools may be applied for producer-consumer model implementation in multithreading programs with constant number of active threads and requirements of high throughput of pools and low latency of operations with pools. Implemented data structure scales well for large number of threads and shows the increase of throughput as the number of threads comes near the number of processor cores. Increasing of tree size in the pool does not reduce the pool throughput. Recommendations for using of pool and the experimental results on multicore computer system are represented in the paper.
- Research Article
22
- 10.1177/0165551513519212
- Jan 13, 2014
- Journal of Information Science
Heuristic search is used in many problems and applications, such as the 15 puzzle problem, the travelling salesman problem and web search engines. In this paper, the A* heuristic search algorithm is reconsidered by proposing a parallel generic approach based on multithreading for solving the 15 puzzle problem. Using multithreading, sequential computers are provided with virtual parallelization, yielding faster execution and easy communication. These advantageous features are provided through creating a dynamic number of concurrent threads at the run time of an application. The proposed approach is evaluated analytically and experimentally and compared with its sequential counterpart in terms of various performance metrics. It is revealed by the experimental results that multithreading is a viable approach for parallel A* heuristic search. For instance, it has been found that the parallel multithreaded A* heuristic search algorithm, in particular, outperforms the sequential approach in terms of time complexity and speedup.
- Book Chapter
1
- 10.1007/978-3-030-87010-2_23
- Jan 1, 2021
The efficiency of using the developed CW-tree data structure in comparison with the B+-tree were analyzed in this article. B+-tree is used in the popular MySQL relational database management system. It has been experimentally proven that B+-tree is not efficient for parallel data retrieval. The study of parallelizing queries with different numbers of threads showed that with an increase in the number of threads, the search speed becomes higher. However, when the number of threads is ≥4, the speed stopped changing. That means that, after four threads, there was no point in increasing the number of threads. For testing the CW-tree a separate physical drive connected via PCI-Express interface were used. The drive is INTEL MEMPEK1W016GA, it has volume of 13.41 gigabytes, its logical sector size is 512 bytes and physical sector size is 512 bytes. A database was created, filled in according to the data structure CW-tree on this intel drive. For the analysis of the CW-tree, 6 search queries were developed with different amounts of returned data. The experiment showed that executing these queries in parallel mode is faster for CW-tree than in B+-tree executing the same queries in MySQL, where B+-tree is used to index data.
- Book Chapter
134
- 10.1007/978-3-642-37036-6_29
- Jan 1, 2013
We present algorithms for checking and enforcing robustness of concurrent programs against the Total Store Ordering (TSO) memory model. A program is robust if all its TSO computations correspond to computations under the Sequential Consistency (SC) semantics. We provide a complete characterization of non-robustness in terms of so-called attacks: a restricted form of (harmful) out-of-program-order executions. Then, we show that detecting attacks can be parallelized, and can be solved using state reachability queries under the SC semantics in a suitably instrumented program obtained by a linear size source-to-source translation. Importantly, the construction is valid for an unbounded number of memory addresses and an arbitrary number of parallel threads. It is independent from the data domain and from the size of store buffers in the TSO semantics. In particular, when the data domain is finite and the number of addresses is fixed, we obtain decidability and complexity results for robustness, even for a parametric number of threads. As a second contribution, we provide an algorithm for computing an optimal set of fences that enforce robustness. We consider two criteria of optimality: minimization of program size and maximization of its performance. The algorithms we define are implemented, and we successfully applied them to analyzing and correcting several concurrent algorithms.
- Dissertation
- 10.26686/wgtn.17060108
- Jan 1, 2017
<p>This thesis explores two kinds of program logics that have become important for modern program verification - separation logic, for reasoning about programs that use pointers to build mutable data structures, and rely guarantee reasoning, for reasoning about shared variable concurrent programs. We look more closely into the motivations for merging these two kinds of logics into a single formalism that exploits the benefits of both approaches - local, modular, and explicit reasoning about interference between threads in a shared memory concurrent program. We discuss in detail two such formalisms - RGSep and Local Rely Guarantee (LRG), in particular we analyse how each formalism models program state and treats the distinction between global state (shared by all threads) and local state (private to a given thread) and how each logic models actions performed by threads on shared state, and look into the proof rules specifically for reasoning about atomic blocks of code. We present full examples of proofs in each logic and discuss their differences. This thesis also illustrates how a weakest precondition semantics for separation logic can be used to carry out calculational proofs. We also note how in essence these proofs are data abstraction proofs showing that a data structure implements some abstract data type, and relate this idea to a classic data abstraction technique by Hoare. Finally, as part of the thesis we also present a survey of tools that are currently available for doing manual or semi-automated proofs as well as program analyses with separation logic and rely guarantee.</p>
- Dissertation
- 10.26686/wgtn.17060108.v1
- Jan 1, 2017
<p>This thesis explores two kinds of program logics that have become important for modern program verification - separation logic, for reasoning about programs that use pointers to build mutable data structures, and rely guarantee reasoning, for reasoning about shared variable concurrent programs. We look more closely into the motivations for merging these two kinds of logics into a single formalism that exploits the benefits of both approaches - local, modular, and explicit reasoning about interference between threads in a shared memory concurrent program. We discuss in detail two such formalisms - RGSep and Local Rely Guarantee (LRG), in particular we analyse how each formalism models program state and treats the distinction between global state (shared by all threads) and local state (private to a given thread) and how each logic models actions performed by threads on shared state, and look into the proof rules specifically for reasoning about atomic blocks of code. We present full examples of proofs in each logic and discuss their differences. This thesis also illustrates how a weakest precondition semantics for separation logic can be used to carry out calculational proofs. We also note how in essence these proofs are data abstraction proofs showing that a data structure implements some abstract data type, and relate this idea to a classic data abstraction technique by Hoare. Finally, as part of the thesis we also present a survey of tools that are currently available for doing manual or semi-automated proofs as well as program analyses with separation logic and rely guarantee.</p>
- Book Chapter
35
- 10.1007/978-3-662-54434-1_24
- Jan 1, 2017
Linearizability is the commonly accepted notion of correctness for concurrent data structures. It requires that any execution of the data structure is justified by a linearization—a linear order on operations satisfying the data structure’s sequential specification. Proving linearizability is often challenging because an operation’s position in the linearization order may depend on future operations. This makes it very difficult to incrementally construct the linearization in a proof. We propose a new proof method that can handle data structures with such future-dependent linearizations. Our key idea is to incrementally construct not a single linear order of operations, but a partial order that describes multiple linearizations satisfying the sequential specification. This allows decisions about the ordering of operations to be delayed, mirroring the behaviour of data structure implementations. We formalise our method as a program logic based on rely-guarantee reasoning, and demonstrate its effectiveness by verifying several challenging data structures: the Herlihy-Wing queue, the TS queue and the Optimistic set.
- Research Article
2
- 10.1145/3016078.2851196
- Feb 27, 2016
- ACM SIGPLAN Notices
Concurrent data structures synchronized with locks do not scale well with the number of threads. As more scalable alternatives, concurrent data structures and algorithms based on widely available, however advanced, atomic operations have been proposed. These data structures allow for correct and concurrent operations without any locks. In this paper, we present a new fully lock-free open addressed hash table with a simpler design than prior published work. We split hash table insertions into two atomic phases: first inserting a value ignoring other concurrent operations, then in the second phase resolve any duplicate or conflicting values. Our hash table has a constant and low memory usage that is less than existing lock-free hash tables at a fill level of 33% and above. The hash table exhibits good cache locality. Compared to prior art, our hash table results in 16% and 15% fewer L1 and L2 cache misses respectively, leading to 21% fewer memory stall cycles. Our experiments show that our hash table scales close to linearly with the number of threads and outperforms, in throughput, other lock-free hash tables by 19%.
- Conference Article
4
- 10.1145/2851141.2851196
- Feb 27, 2016
Concurrent data structures synchronized with locks do not scale well with the number of threads. As more scalable alternatives, concurrent data structures and algorithms based on widely available, however advanced, atomic operations have been proposed. These data structures allow for correct and concurrent operations without any locks. In this paper, we present a new fully lock-free open addressed hash table with a simpler design than prior published work. We split hash table insertions into two atomic phases: first inserting a value ignoring other concurrent operations, then in the second phase resolve any duplicate or conflicting values.Our hash table has a constant and low memory usage that is less than existing lock-free hash tables at a fill level of 33% and above. The hash table exhibits good cache locality. Compared to prior art, our hash table results in 16% and 15% fewer L1 and L2 cache misses respectively, leading to 21% fewer memory stall cycles. Our experiments show that our hash table scales close to linearly with the number of threads and outperforms, in throughput, other lock-free hash tables by 19%.
- Research Article
29
- 10.1145/1925844.1926415
- Jan 26, 2011
- ACM SIGPLAN Notices
Fine-grained concurrent data structures are crucial for gaining performance from multiprocessing, but their design is a subtle art. Recent literature has made large strides in verifying these data structures, using either atomicity refinement or separation logic with rely-guarantee reasoning. In this paper we show how the ownership discipline of separation logic can be used to enable atomicity refinement, and we develop a new rely-guarantee method that is localized to the definition of a data structure. We present the first semantics of separation logic that is sensitive to atomicity, and show how to control this sensitivity through ownership. The result is a logic that enables compositional reasoning about atomicity and interference, even for programs that use fine-grained synchronization and dynamic memory allocation.
- Conference Article
31
- 10.1145/1926385.1926415
- Jan 26, 2011
Fine-grained concurrent data structures are crucial for gaining performance from multiprocessing, but their design is a subtle art. Recent literature has made large strides in verifying these data structures, using either atomicity refinement or separation logic with rely-guarantee reasoning. In this paper we show how the ownership discipline of separation logic can be used to enable atomicity refinement, and we develop a new rely-guarantee method that is localized to the definition of a data structure. We present the first semantics of separation logic that is sensitive to atomicity, and show how to control this sensitivity through ownership. The result is a logic that enables compositional reasoning about atomicity and interference, even for programs that use fine-grained synchronization and dynamic memory allocation.
- Conference Article
1
- 10.1145/2594291.2594344
- Jun 9, 2014
The aim of AEMINIUM is to study the implications of having a concurrent-by-default programming language. This includes language design, runtime system, performance and software engineering considerations. We conduct our study through the design of the concurrent-by-default AEMINIUM programming language. AEMINIUM leverages the permission flow of object and group permissions through the program to validate the program's correctness and to automatically infer a possible parallelization strategy via a dataflow graph. AEMINIUM supports not only fork-join parallelism but more general dataflow patterns of parallelism. In this paper we present a formal system, called μAEMINIUM, modeling the core concepts of AEMINIUM. μAEMINIUM's static type system is based on Featherweight Java with AEMINIUM-specific extensions. Besides checking for correctness AEMINIUM's type system it also uses the permission flow to compute a potential parallel execution strategy for the program. μAEMINIUM's dynamic semantics use a concurrent-by-default evaluation approach. Along with the formal system we present its soundness proof. We provide a full description of the implementation along with the description of various optimization techniques we used. We implemented AEMINIUM as an extension of the Plaid programming language, which has first-class support for permissions built-in. The AEMINIUM implementation and all case studies are publicly available under the General Public License. We use various case studies to evaluate AEMINIUM's applicability and to demonstrate that AEMINIUM parallelized code has performance improvements compared to its sequential counterpart. We chose to use case studies from common domains or problems that are known to benefit from parallelization, to show that AEMINIUM is powerful enough to encode them. We demonstrate through a webserver application, which evaluates AEMINIUM's impact on latency-bound applications, that AEMINIUM can achieve a 70% performance improvement over the sequential counterpart. In another case study we chose to implement a dictionary function to evaluate AEMINIUM's capabilities to express essential data structures. Our evaluation demonstrates that AEMINIUM can be used to express parallelism in such data-structures and that the performance benefits scale with the amount of annotation effort which is put into the implementation. We chose an integral computationally example to evaluate pure functional programming and computational intensive use cases. Our experiments show that AEMINIUM is capable of extracting parallelism from functional code and achieving performance improvements up to the limits of Plaid's inherent performance bounds. Overall, we hope that the work helps to advance concurrent programming in modern programming environments.
- Research Article
- 10.1145/2666356.2594344
- Jun 5, 2014
- ACM SIGPLAN Notices
The aim of ÆMINIUM is to study the implications of having a concurrent-by-default programming language. This includes language design, runtime system, performance and software engineering considerations. We conduct our study through the design of the concurrent-by-default ÆMINIUM programming language. ÆMINIUM leverages the permission flow of object and group permissions through the program to validate the program's correctness and to automatically infer a possible parallelization strategy via a dataflow graph. ÆMINIUM supports not only fork-join parallelism but more general dataflow patterns of parallelism. In this paper we present a formal system, called μÆMINIUM, modeling the core concepts of ÆMINIUM. μÆMINIUM's static type system is based on Featherweight Java with ÆMINIUM-specific extensions. Besides checking for correctness ÆMINIUM's type system it also uses the permission flow to compute a potential parallel execution strategy for the program. μÆMINIUM's dynamic semantics use a concurrent-by-default evaluation approach. Along with the formal system we present its soundness proof. We provide a full description of the implementation along with the description of various optimization techniques we used. We implemented ÆMINIUM as an extension of the Plaid programming language, which has first-class support for permissions built-in. The ÆMINIUM implementation and all case studies are publicly available under the General Public License. We use various case studies to evaluate ÆMINIUM's applicability and to demonstrate that ÆMINIUM parallelized code has performance improvements compared to its sequential counterpart. We chose to use case studies from common domains or problems that are known to benefit from parallelization, to show that ÆMINIUM is powerful enough to encode them. We demonstrate through a webserver application, which evaluates ÆMINIUM's impact on latency-bound applications, that ÆMINIUM can achieve a 70% performance improvement over the sequential counterpart. In another case study we chose to implement a dictionary function to evaluate ÆMINIUM's capabilities to express essential data structures. Our evaluation demonstrates that ÆMINIUM can be used to express parallelism in such data-structures and that the performance benefits scale with the amount of annotation effort which is put into the implementation. We chose an integral computationally example to evaluate pure functional programming and computational intensive use cases. Our experiments show that ÆMINIUM is capable of extracting parallelism from functional code and achieving performance improvements up to the limits of Plaid's inherent performance bounds. Overall, we hope that the work helps to advance concurrent programming in modern programming environments.