Scalable Distributed Transaction Processing on Modern RDMA-enabled Networks

Erfan Zamanian

doi:10.1145/3448016.3461469

Abstract

The scalability of existing shared-nothing databases quickly degrades in the presence of distributed transactions with the networking being the main reason. When distributed databases were designed, networks had low bandwidth, high latency, and high CPU overhead per message because of the traditional network stack. As a result, distributed database were designed to avoid distributed transactions as much as possible. Yet, with the advent of next generation of high-speed RDMA-enabled networks, it is time to revisit this design mantra of distributed DBMSs. In my thesis, the main focus was thus on the redesign of distributed transaction processing systems by efficiently leveraging modern RDMA-enabled networks. With Remote Direct Memory Access (RDMA), it is possible to bypass the CPU when transferring data from one machine to another. Moreover, the current generation of these networks is already able to provide a bandwidth similar to that of the main memory. However, our first finding is that simply upgrading the network does not automatically yield scalability without redesigning the underlying distributed database. In the quest for building fast, scalable and highly available OLTP databases, my thesis revisited the design of distributed DBMSs for RDMA and made various core contributions to the DBMS community: 1- Do we need a new DBMS architecture? We propose a new abstraction called Network-Attached-Memory (NAM) which fully exploits RDMA for distributed DBMSs. The main idea is to decouple compute and storage to enable independent scalability, and allow compute nodes to access data on the storage nodes using RDMA operations. Using this architecture, physical co-location of compute and storage to improve performance becomes a second class design consideration, as opposed to being a necessity for scalability. 2- Do we need new transaction protocols? Using the NAM architecture, it is possible to design scalable OLTP systems which efficiently leverage low overhead RDMA. However, it requires revisiting existing data structures and transaction protocols as they were designed in the era of slow networks with high per message cost. To this end, we presented the design of our novel scalable OLTP engine called NAM-DB. It implements the very common Snapshot Isolation scheme but tailored for RDMA. For example, NAM-DB builds on a scalable timestamp generation algorithm which efficiently utilizes one-sided RDMA operations to decentralize this task. The experiments on NAM-DB show that distributed transactions can indeed scale, without an inherent bottleneck other than those imposed by the workload itself (as I will discuss next). 3- Do we need new partitioning strategies? The primary goal of existing partitioning techniques in distributed DBMSs is to minimize the number cross-partition transactions, simply because network used to be the dominant bottleneck. In modern networks, however, we found that the new bottleneck which hinders scalability is data contention, while minimizing network communication plays only a subordinate role. To this end, we developed a new solution called Chiller that extends NAM-DB in two directions: (1) a novel commit protocol based on re-ordering transaction operations with the goal of minimizing the lock duration for contended records, and (2) contention-aware partitioning so that the most critical records can be updated without additional coordination. 4- Do we need new high-availability protocols? Finally, we revisited high availability for NAM-DB. Same as for partitioning, the main goal of existing high availability approaches is to minimize the network overhead which is no longer a bottleneck with fast RDMA-enables networks. This calls also for new protocols to fully unleash the potential of RDMA networks for high availability. Hence, as a last contribution we present a novel strongly consistent replication scheme called Active-Memory. Our proposed primary-backup replication algorithm allows an RDMA-based OLTP system to maintain its high performance in the presence of failures through an efficient RDMA-based undo-logging scheme, achieving much better performance compared to the existing techniques.

Full Text