Host Memory Research Articles

Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as compressed sparse row (CSR). To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data pre-processing/partitioning or unified virtual memory (UVM) to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cache-line-sized access to data stored in host memory. This paper addresses the open question of whether a sufficiently large number of overlapping cache-line-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth with direct cache-line accesses to the host memory. EMOGI achieves 2.60X speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.

Read full abstract

Log-structured merge tree (i.e., LSM-tree)-based key–value stores (i.e., KV stores) are widely used in big-data applications and provide high performance. NAND Flash-based Solid-state disks (i.e., SSDs) have become a popular storage device alternative to hard disk drives (i.e., HDDs) because of their high performance and low power consumption. LSM-tree KV stores with SSDs are deployed in large-scale storage systems, which aims to achieve high performance in the cloud. Write amplification in LSM-tree KV stores and NAND Flash memory in SSDs are defined as WA1 and WA2 in this paper. The former, which is attributed to compaction operations in LSM-tree-based KV stores, is a burden on I/O bandwidth between the host and the device. The latter, which results from out-place updates in NAND Flash memory, blocks user I/O requests between the host and NAND Flash memory, thereby degrading the SSD performance. Write amplification impairs the overall system performance. In this study, we explored the two-level cascaded write amplification in LSM-tree KV stores with SSDs. The cascaded write amplification is represented as WA. Our primary goal is to comprehensively study two-level cascaded write amplification on the host-side LSM-tree KV stores and the device-side SSDs. We quantitatively analyze the impact of two-level write amplification on overall performance. The cascaded write amplification is 16.44 (WA1 is 16.55; WA2 is 0.99) and 35.51 (WA1 is 16.6; WA2 is 2.14) for SSD-I and SSD-S with LevelDB’s default setting under DB_bench. The larger cascaded write amplification of KV stores has a bad impact on SSD performance and lifetime. The throughput of SSD-S and SSD-I under an 80%-write workload is approximately 0.28x and 0.31x of that under a 100%-write workload. Therefore, it is important to design a novel approach to balance the cost of an SSD lifetime caused by cascaded write amplification and its high performance under the read-write-mixed workloads. We attempt to reveal details of cascaded write amplification and hope that this study is useful for developers of LSM-tree-based KV stores and SSD software stacks.

Read full abstract

Host Memory Research Articles

Related Topics

Articles published on Host Memory

BRGraph: An efficient graph neural network training system by reusing batch data on GPU

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Demand MemCpy: Overlapping of Computation and Data Transfer for Heterogeneous Computing

VEDAS: an efficient GPU alternative for store and query of large RDF data sets

Large graph convolutional network training with GPU-oriented data communication architecture

EBV et immunodépression

Mille Cheval: a GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams

Efficient Flow Processing in 5G-Envisioned SDN-Based Internet of Vehicles Using GPUs

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

EMOGI

Comparing unified, pinned, and host/device memory allocations for memory‐intensive workloads on Tegra SoC

Kidney-intrinsic factors determine the severity of ischemia/reperfusion injury in a mouse model of delayed graft function

Cascaded Write Amplification of LSM-tree-based Key-Value Stores underlying Solid-State Disks

HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives

Hierarchical Orchestration of Disaggregated Memory

Communication protocol optimization for enhanced GPU performance

HMB in DRAM-less NVMe SSDs: Their usage and effects on performance.

Traversing large graphs on GPUs with unified memory

High-performance attribute reduction on graphics processing unit

Lack of B Lymphocytes Enhances CD8 T Cell-Mediated Resistance against Respiratory Viral Infection but Compromises Memory Cell Formation.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Host Memory Research Articles

Related Topics

Articles published on Host Memory

BRGraph: An efficient graph neural network training system by reusing batch data on GPU

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Demand MemCpy: Overlapping of Computation and Data Transfer for Heterogeneous Computing

VEDAS: an efficient GPU alternative for store and query of large RDF data sets

Large graph convolutional network training with GPU-oriented data communication architecture

EBV et immunodépression

Mille Cheval: a GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams

Efficient Flow Processing in 5G-Envisioned SDN-Based Internet of Vehicles Using GPUs

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

EMOGI

Comparing unified, pinned, and host/device memory allocations for memory‐intensive workloads on Tegra SoC

Kidney-intrinsic factors determine the severity of ischemia/reperfusion injury in a mouse model of delayed graft function

Cascaded Write Amplification of LSM-tree-based Key-Value Stores underlying Solid-State Disks

HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives

Hierarchical Orchestration of Disaggregated Memory

Communication protocol optimization for enhanced GPU performance

HMB in DRAM-less NVMe SSDs: Their usage and effects on performance.

Traversing large graphs on GPUs with unified memory

High-performance attribute reduction on graphics processing unit

Lack of B Lymphocytes Enhances CD8 T Cell-Mediated Resistance against Respiratory Viral Infection but Compromises Memory Cell Formation.