Reliability-aware Garbage Collection for Hybrid HBM-DRAM Memories

Wenjie Liu,Jennifer B Sartor,Shoaib Akram,Lieven Eeckhout

doi:10.1145/3431803

Abstract

Emerging workloads in cloud and data center infrastructures demand high main memory bandwidth and capacity. Unfortunately, DRAM alone is unable to satisfy contemporary main memory demands. High-bandwidth memory (HBM) uses 3D die-stacking to deliver 4–8× higher bandwidth. HBM has two drawbacks: (1) capacity is low, and (2) soft error rate is high. Hybrid memory combines DRAM and HBM to promise low fault rates, high bandwidth, and high capacity. Prior OS approaches manage HBM by mapping pages to HBM versus DRAM based on hotness (access frequency) and risk (susceptibility to soft errors). Unfortunately, these approaches operate at a coarse-grained page granularity, and frequent page migrations hurt performance. This article proposes a new class of reliability-aware garbage collectors for hybrid HBM-DRAM systems that place hot and low-risk objects in HBM and the rest in DRAM. Our analysis of nine real-world Java workloads shows that: (1) newly allocated objects in the nursery are frequently written, making them both hot and low-risk, (2) a small fraction of the mature objects are hot and low-risk, and (3) allocation site is a good predictor for hotness and risk. We propose RiskRelief, a novel reliability-aware garbage collector that uses allocation site prediction to place hot and low-risk objects in HBM. Allocation sites are profiled offline and RiskRelief uses heuristics to classify allocation sites as DRAM and HBM. The proposed heuristics expose Pareto-optimal trade-offs between soft error rate (SER) and execution time. RiskRelief improves SER by 9× compared to an HBM-Only system while at the same time improving performance by 29% compared to a DRAM-Only system. Compared to a state-of-the-art OS approach for reliability-aware data placement, RiskRelief eliminates all page migration overheads, which substantially improves performance while delivering similar SER. Reliability-aware garbage collection opens up a new opportunity to manage emerging HBM-DRAM memories at fine granularity while requiring no extra hardware support and leaving the programming model unchanged.

Highlights

Emerging cloud workloads, such as machine learning inference and stream analytics, have encouraged new throughput-oriented compute platforms
Following prior work by Villavieja et al [67], we model the overhead of a TLB shootdown in a system with N cores as follows: Tshootdown = N × Tslave + Tinitiator, with Tslave and Tinitiator the time overheads incurred by each slave and initiator cores, respectively
Emerging High Bandwidth Memory (HBM) uses 3D stacking to offer more bandwidth than DRAM

Summary

Introduction

Emerging cloud workloads, such as machine learning inference and stream analytics, have encouraged new throughput-oriented compute platforms These platforms consist of many-core processors, graphic processing units, and a range of accelerators. Hybrid HBM-DRAM memory combines the best of both worlds to provide high capacity and high bandwidth. Before describing how RiskRelief predicts hotness and risk and leverages these predictions to manage hybrid HBM-DRAM systems, we first provide additional background in soft error reliability and managed runtimes. Hotness refers to how frequently an object is accessed, whereas risk refers to how susceptible an object is to soft errors. We define both concepts and focus on risk more, because it is a less well-known metric. The high percentage of writes motivates our hotness criteria as the sum of reads and writes

Objectives

Methods

Results

Conclusion