Power-optimized Deployment of Key-value Stores Using Storage Class Memory

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.

Similar Papers
  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-319-68210-5_1
SCMKV: A Lightweight Log-Structured Key-Value Store on SCM
  • Jan 1, 2017
  • Zhenjie Wang + 2 more

Storage Class Memories (SCMs) are promising technologies that would change the future of storage, with many attractive capabilities such as byte addressability, low latency and persistence. Existing key-value stores proposed for block devices use SCMs as block devices, which conceal the performance that SCMs provide. A few existing key-value stores for SCMs fail to provide consistency when hardware supports such as cache flush on power failure are unavailable. In this paper, we present a key-value store called SCMKV that provides consistency, performance and scalability. It takes advantage of characteristics of key-value workloads and leverages the log-structured technique for high throughput. In particular, we propose a static concurrent cache-friendly hash table to accelerate accesses to key-value objects, and maintain separate data logs and memory allocators for each worker thread for achieving high concurrency. To reduce write latency, it tries to reduce writes to SCMs and cache flushing instructions. Our experiments show that SCMKV achieves much higher throughput and has better scalability than state-of-the-art key-value stores.

  • Conference Article
  • Cite Count Icon 10
  • 10.1145/2928275.2933273
Using Storage Class Memory Efficiently for an In-memory Database
  • Jun 6, 2016
  • Yonatan Gottesman + 4 more

Storage class memory (SCM) is an emerging class of memory devices that are both byte addressable, and persistent. There are many different technologies that can be considered SCM, at different stages of maturity. Examples of such technologies include NVDIMM-N, PCM, SttRAM, Racetrack, FeRAM, and others.Currently, applications rely on storage technologies such as Flash memory and hard disks as the physical media for persistency. SCM behaves differently and has significantly different characteristics than these existing technologies. That means there is a need for a fundamental change in the way we program data persistency in applications to fully realize the potential of this new class of device.Previous work such as [1] focuses on designing a filesystem optimized to run over SCM memory. Other projects such as Mnemosyne [4] provide a general purpose API for applications to use SCM memory. Mnemosyne however, does not provide a transaction mechanism flexible enough to allow different SCM updates spanning different parts of the code to be considered one transaction.Our work is focused on employing minimal changes needed to retrofit an existing key-value store to take advantage of SCM technology. We demonstrate these changes on Redis (REmote DIctionary Server) [3], which is a popular key-value store. We show how Redis can be modified to take advantage of these new abilities by allowing the application to manage its own storage in a unique way. Our approach is to use two types of memory technologies (DRAM and SCM) for different purposes in a single application. To optimize the system data capacity, we keep a minimal dataset in persistent memory, while keeping metadata (such as indexing) in DRAM, which can be rebuilt upon failures.Persistency in Redis is currently performed by logging all transactions to an append-only log file (AOF). These transactions can then be replayed to recover from a failure. The transactions are not made persistent until the AOF file is flushed to disk, which is very slow. Flushing after every transaction has a performance impact, but flushing periodically creates a risk of lost data. By using SCM instead of a disk, we can effectively flush every transaction without impacting performance.In order to change Redis to store data objects on the SCM, we must ensure consistency of persistent data even after an unexpected shutdown. To ensure consistency, a modified version of dlmalloc [2] is used for all allocations done on the SCM, and mfence commands are used to overcome unexpected reordering.We model the SCM using a memory mapped file backed on a ramdisk. We compared our changes to Redis using an AOF backed on a ramdisk. Although we don't take into account the latency overheads of accessing the SCM, this comparison gives us a good upper bound of performance benefits of using SCM memory. Our results demonstrate an average latency reduction of 43% and average throughput increase of 75%.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icisce48695.2019.00046
Challenges and Implications of Memory Management Systems under Fast SCM Storage
  • Dec 1, 2019
  • Yunjoo Park + 2 more

Recently, Storage-Class Memory (SCM) has advanced as a new memory/storage medium, and legacy memory subsystems optimized for DRAM-HDD architectures need to be redesigned. In this paper, we revisit the memory subsystems that use SCM as an underlying storage device and discuss the challenges and implications of such systems. Specifically, we analyze two memory layers influenced by fast storage devices: buffer cache and paging systems. In case of buffer cache, our analysis shows that caching of a file block gains only when the block from SCM storage is accessed at least twice after entering the cache. This is contrasting to the HDD case, in which only a single access in the cache also gains. In case of paging systems, we found out that a small page is effective in improving data access latency although it does not gain in terms of the page fault ratio. However, we further observed that a small page degrades the TLB miss ratio, which eventually deteriorates the address translation latency. Thus determining an appropriate page size is necessary by considering the trade-off between address translation and data access latency, under SCM storage. We anticipate that the result of this paper will be helpful in designing memory subsystems with ever faster SCM storage devices.

  • Conference Article
  • Cite Count Icon 64
  • 10.1145/2527792.2527799
Exploring storage class memory with key value stores
  • Nov 3, 2013
  • Katelin A Bailey + 4 more

In the near future, new storage-class memory (SCM) technologies -- such as phase-change memory and memristors -- will radically change the nature of long-term storage. These devices will be cheap, non-volatile, byte addressable, and near DRAM density and speed. While SCM offers enormous opportunities, profiting from them will require new storage systems specifically designed for SCM's properties.This paper presents Echo, a persistent key-value storage system designed to leverage the advantages and address the challenges of SCM. The goals of Echo include high performance for both small and large data objects, recoverability after failure, and scalability on multicore systems. Echo achieves its goals through the use of a two-level memory design targeted for memory systems containing both DRAM and SCM, exploitation of SCM's byte addressability for fine-grained transactions in non-volatile memory, and the use of snapshot isolation for concurrency, consistency, and versioning. Our evaluation demonstrates that Echo's SCM-centric design achieves the durability guarantees of the best disk-based stores with the performance characteristics approaching the best in-memory key-value stores.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-642-18206-8_7
Impact of Recent Hardware and Software Trends on High Performance Transaction Processing and Analytics
  • Jan 1, 2011
  • C Mohan

In this paper, I survey briefly some of the recent and emerging trends in hardware and software features which impact high performance transaction processing and data analytics applications. These features include multicore processor chips, ultra large main memories, flash storage, storage class memories, database appliances, field programmable gate arrays, transactional memory, key-value stores, and cloud computing. While some applications, e.g., Web 2.0 ones, were initially built without traditional transaction processing functionality in mind, slowly system architects and designers are beginning to address such previously ignored issues. The availability, analytics and response time requirements of these applications were initially given more importance than ACID transaction semantics and resource consumption characteristics. A project at IBM Almaden is studying the implications of phase change memory on transaction processing, in the context of a key-value store. Bitemporal data management has also become an important requirement, especially for financial applications. Power consumption and heat dissipation properties are also major considerations in the emergence of modern software and hardware architectural features. Considerations relating to ease of configuration, installation, maintenance and monitoring, and improvement of total cost of ownership have resulted in database appliances becoming very popular. The MapReduce paradigm is now quite popular for large scale data analysis, in spite of the major inefficiencies associated with it.KeywordsAnalyticsAppliancesCloud ComputingDatabasesFPGAsHardwareKey-Value StoresMulticorePerformanceSoftwareStorage Class MemoriesTransaction Processing

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3349341.3349469
Design and Implementation of SCM and SSD based Hybrid Key-Value Store
  • Jul 12, 2019
  • Ling Zhan + 2 more

Storage Class Memory (also called Non-volatile Memory) has many advantages like high performance and byte-addressable which can beneficial to designing of storage system. In order to combine the merits of SCM and SSD, a hybrid and efficient key value storage system based on SCM and SSD named SSHKV (SCM and SSD Hybrid Key-Value Store) is designed. SSHKV stores key and metadata in SCM, and stores value in the SSD by log-structure format to achieve the balance between performance and capacity. Besides, we approve strategy called logical space amplification to reduce the valid data migration while GC by using TRIM instruction to release the invalid pages in physical space, and then mapped them to new logical space. According to the test, the random write throughput of SSHKV is about 6.8× better than LevelDB, a currently popular key-value store engine based on LSM-Tree. Also, according to the tests under different logical space amplification factor, the strategy of amplifying logical space can effectively improve system performance.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/imw48823.2020.9108145
Emerging Usage and Evaluation of Low Latency FLASH
  • May 1, 2020
  • Tatsuo Shiozawa + 3 more

Storage class memory (SCM) is expected to fill the gap between DRAM and flash SSD Storage. In this paper, we introduce XL-FLASH™, a cost effective flash based SCM, and the XL-FLASH demo drive featuring a low penalty low latency DMA control interface. We show the evaluation result of achieving equivalent performance between in-memory database and the proposed key-value store database using XL-FLASH demo drive. This result demonstrates the possibility of replacing DRAM with XL-FLASH in a key-value store database application, for highly concurrent read intensive use cases.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/access.2018.2873579
<italic>DStore</italic>: A Holistic Key-Value Store Exploring Near-Data Processing and On-Demand Scheduling for Compaction Optimization
  • Jan 1, 2018
  • IEEE Access
  • Hui Sun + 4 more

Log-structured merge tree (LSM-tree)-based key-value stores are widely deployed in largescale storage systems. The underlying reason is that the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores can support high-throughput write operations and provide high sequential bandwidth in storage systems. However, the compaction process triggers write amplification and is confronted with the degraded write performance, especially under update-intensive workloads. To address this issue, we design a holistic key-value store to explorer near-data processing (NDP) and on-demand scheduling for compaction optimization in an LSM-tree key-value store, named DStore. DStore makes full use of various computing capacities in the host-side and device-side subsystems. DStore dynamically divides the whole host-side compaction tasks into the above two-side subsystems according to two-side different computing capabilities. Meanwhile, the device must be featured with an NDP model. The divided compaction tasks are performed by the host and the device in parallel. In DStore, the NDP-based devices exhibit low-latency and high-bandwidth performance, thus facilitating key-value stores. DStore not only accomplishes compaction for key-value stores but also improves the system performance. We implement our DStore prototype in a real-world platform, and different kinds of testbeds are employed in our experiment. LevelDB and a static compaction optimization using the NDP model (called Co-KV) are used to compare with the DStore in our evaluation. Results show that DStore achieves about 3.7× performance improvement over LevelDB under the db_bench workload. In addition, DStore-enabled key-value stores outperform LevelDB by a factor of about 3.3× and 77% in terms of throughput and latency under YCSB benchmark, respectively.

  • Conference Article
  • Cite Count Icon 152
  • 10.1109/icde.2011.5767918
High performance database logging using storage class memory
  • Apr 1, 2011
  • Ru Fang + 4 more

Storage class memory (SCM), a new generation of memory technology, offers non-volatility, high-speed, and byte-addressability, which combines the best properties of current hard disk drives (HDD) and main memory. With these extraordinary features, current systems and software stacks need to be redesigned to get significantly improved performance by eliminating disk input/output (I/O) barriers; and simpler system designs by avoiding complicated data format transformations. In current DBMSs, logging and recovery are the most important components to enforce the atomicity and durability of a database. Traditionally, database systems rely on disks for logging transaction actions and log records are forced to disks when a transaction commits. Because of the slow disk I/O speed, logging becomes one of the major bottlenecks for a DBMS. Exploiting SCM as a persistent memory for transaction logging can significantly reduce logging overhead. In this paper, we present the detailed design of an SCM-based approach for DBMSs logging, which achieves high performance by simplified system design and better concurrency support. We also discuss solutions to tackle several major issues arising during system recovery, including hole detection, partial write detection, and any-point failure recovery. This new logging approach is used to replace the traditional disk based logging approach in DBMSs. To analyze the performance characteristics of our SCM-based logging approach, we implement the prototype on IBM SolidDB. In common circumstances, our experimental results show that the new SCM-based logging approach provides as much as 7 times throughput improvement over disk-based logging in the Telecommunication Application Transaction Processing (TATP) benchmark.

  • Book Chapter
  • 10.1016/b978-0-32-390796-5.00011-5
Chapter 2 - Storage technologies and their data
  • Jan 1, 2022
  • Storage Systems
  • Alexander Thomasian

Chapter 2 - Storage technologies and their data

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/fccm53951.2022.9786121
Augmenting HLS with Zero-Overhead Application-Specific Address Mapping for Optane DCPMM
  • May 15, 2022
  • Nicholas Beckwith + 2 more

FPGAs have been introduced to datacenters as a mainstream computing device to accelerate a wide range of data-intensive applications when paired with heterogeneous memory. Leveraging High-Level Synthesis (HLS), application engineers not only can accelerate their applications but also the development time of designing, debugging and validating accelerators. However, existing HLS flows do not have effective support for emerging memory devices such as Intel’s Optane DC Persistent Memory Modules (Optane DCPMM) – a storage-class memory in a DIMM form factor. In fact, we observe that some HLS kernels can at best utilize only one-tenth of the total memory bandwidth of Optane DCPMM.To remedy the poor performance of HLS with Optane DCPMM, we augment the existing HLS external memory interface with zero-overhead, application-specific address mapping capabilities. The proposed scheme utilizes both fine-grained information from variable access patterns and coarse-grained variable-interleaving information to select an optimal hybrid address mapping for high memory bandwidth utilization, compared to a default fixed address mapping in existing HLS. Furthermore, our scheme is compatible with existing tool flows such as the Intel FPGA SDK for OpenCL and Vitis Application Flow to maintain a low adoption barrier. We observe that by using our proposed address mapping scheme and interface, we achieve 10× speedup on a diverse set of benchmarks including merge join, matrix multiplication and convolution without any additional hardware cost.

  • Research Article
  • Cite Count Icon 27
  • 10.1109/tpds.2021.3118599
TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adaptive Indexing and Space-Efficient Partitioning
  • Aug 1, 2022
  • IEEE Transactions on Parallel and Distributed Systems
  • Kai Lu + 5 more

LSM-tree based key-value (KV) stores suffer severe read performance loss due to the leveled structure of the LSM-tree. Especially, when modern storage devices with high bandwidth and low latency are used, the read performance of KV store is seriously affected by inefficient file indexing. Besides, due to the deletion pattern of inserting tombstones, the KV stores based on LSM-tree are faced with the problem of read performance fluctuations that are caused by large-scale data deletion (also referred to as the Read-After-Delete problem). In this article, TridentKV is proposed to improve the read performance of KV stores. An adaptive learned index structure is first designed to speed up file indexing. Also, a space-efficient partition strategy is proposed to solve the Read-After-Delete problem. Besides, asynchronous reading design is adopted, and SPDK is supported for high concurrency and low latency. TridentKV is implemented on RocksDB and the evaluation results indicate that compared with RocksDB, the read performance of TridentKV is improved by 7× to 12× without loss of write performance and TridentKV provides stable read performance even if a large number of deletions or migrations occur. Instead of RocksDB, TridentKV is exploited to store metadata in Ceph, which improves the read performance of Ceph by 20% <inline-formula><tex-math notation="LaTeX">$\sim$</tex-math></inline-formula> 60%.

  • Conference Article
  • Cite Count Icon 39
  • 10.1109/estimedia.2011.6088521
Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache
  • Oct 1, 2011
  • Jianhua Li + 4 more

Hybrid cache architectures have been proposed to mitigate the increasing on-chip power dissipation through the exploitation of the emerging non-volatile memories (NVMs). To overcome the high energy and long latency associated with write operations of NVMs, a small SRAM is typically incorporated into the hybrid cache for accommodating write-intensive cache blocks. How to efficiently manage this SRAM and manipulate the write operations are crucial to the performance of the hybrid cache. In this paper, we first present our observation that the intensity of write operations on different cache sets is usually non-uniform for real applications, such as multimedia, multi-programmed, multithreaded applications. The previously proposed hybrid cache schemes can not efficiently and symmetrically utilize the small SRAM to accommodate such widely-existing non-uniform writes on cache sets. Based on this observation, we propose a novel hybrid cache design, Dual Associative Hybrid Cache (denoted as DAHYC), as well as the corresponding cache management policy. By organizing the SRAM blocks in the hybrid cache as a semi-independent set-associative cache, several hybrid cache sets can efficiently share and cooperatively utilize their SRAM blocks, instead of exclusively utilizing the SRAM blocks in each cache set in previous hybrid cache schemes, to boost power-efficiency. Through prudently manipulating the locality information of SRAM blocks in both the NVM sets and the SRAM sets, the proposed cache management policy also delivers high-performance. Experimental results show that, compared with previous works, the DAHYC can reduce the dynamic power of the hybrid cache by 24.8% on average and up to 54% for SPEC2000 INT benchmarks, while at the same time improving the performance of the hybrid cache by 1.16% on average.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/ipdps47924.2020.00104
FlashKey:A High-Performance Flash Friendly Key-Value Store
  • May 1, 2020
  • Madhurima Ray + 3 more

Key-value stores (KVS) provide an efficient storage for increasing amounts of semi-structured or unstructured data generated by many applications. Most KVS in existence have been designed for hard-disk based storage where avoiding random accesses is crucial for good performance. Unfortunately, the resulting storage structures result in high read, write, and space amplifications when used on modern SSDs. In this paper, we introduce a KV store especially designed for SSDs, called FlashKey, and demonstrate that even as an initial implementation, it substantially outperforms the two most popular commercial KVS in existence, namely, Google's LevelDB and Facebook's RocksDB. In particular, we show that FlashKey achieves up to 85% improvement in average access latency, 2x improvement in tail latencies, and 12x improvement in write amplification, at comparable or better space-amplification. Furthermore, FlashKey can easily trade off space and write amplifications, thereby providing a new tuning knob that is difficult to implement in LevelDB and RocksDB.

  • Research Article
  • Cite Count Icon 127
  • 10.14778/2809974.2809984
Mega-KV
  • Jul 1, 2015
  • Proceedings of the VLDB Endowment
  • Kai Zhang + 5 more

In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a large working set. As data volume continues to increase, our experiments show that conventional and general-purpose multicore systems are increasingly mismatched to the special properties of key-value stores because they do not provide massive data parallelism and high memory bandwidth; the powerful but the limited number of computing cores do not satisfy the demand of the unique data processing task; and the cache hierarchy may not well benefit to the large working set. In this paper, we make a strong case for GPUs to serve as special-purpose devices to greatly accelerate the operations of in-memory key-value stores. Specifically, we present the design and implementation of Mega-KV, a GPU-based in-memory key-value store system that achieves high performance and high throughput. Effectively utilizing the high memory bandwidth and latency hiding capability of GPUs, Mega-KV provides fast data accesses and significantly boosts overall performance. Running on a commodity PC installed with two CPUs and two GPUs, Mega-KV can process up to 160+ million key-value operations per second, which is 1.4-2.8 times as fast as the state-of-the-art key-value store system on a conventional CPU-based platform.

Save Icon
Up Arrow
Open/Close