Modern Workloads Research Articles

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their modern CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

Read full abstract

The workloads running in the modern data centers of large scale Internet service providers (such asAlibaba, Amazon, Baidu, Facebook, Google, and Microsoft) support billions of users and span globallydistributed infrastructure. Yet, the devices used in modern data centers fail due to a variety of causes, fromfaulty components to bugs to misconfiguration. Faulty devices make operating large scale data centerschallenging because the workloads running in modern data centers consist of interdependent programsdistributed across many servers, so failures that are isolated to a single device can still have a widespreadeffect on a workload.In this dissertation, we measure and model the device failures in a large scale Internet service company,Facebook. We focus on three device types that form the foundation of Internet service data centerinfrastructure: DRAM for main memory, SSDs for persistent storage, and switches and backbone linksfor network connectivity. For each of these device types, we analyze long term device failure data brokendown by important device attributes and operating conditions, such as age, vendor, and workload. Wealso build and release statistical models of the failure trends for the devices we analyze.For DRAM devices, we analyze the memory errors in the entire fleet of servers at Facebook over thecourse of fourteen months, representing billions of device days of operation. The systems we examinecover a wide range of devices commonly used in modern servers, with DIMMs that use the modernDDR3 communication protocol, manufactured by 4 vendors in capacities ranging from 2GB to 24GB.We observe several new reliability trends for memory systems that have not been discussed before inliterature, develop a model for memory reliability, show how system design choices such as using lowerdensity DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%.We perform the first implementation and real-system analysis of page offlining at scale, on a cluster ofthousands of servers, identify several real-world impediments to the technique, and show that it canreduce memory error rate by 67%. We also examine the efficacy of a new technique to reduce DRAMfaults, physical page randomization, and examine its potential for improving reliability and its overheads.For SSD devices, we perform a large scale study of flash-based SSD reliability at Facebook. We analyzedata collected across a majority of flash-based solid state drives over nearly four years and manymillions of operational hours in order to understand failure properties and trends of flash-based SSDs.Our study considers a variety of SSD characteristics, including: the amount of data written to and readfrom flash chips; how data is mapped within the SSD address space; the amount of data copied, erased,and discarded by the flash controller; and flash board temperature and bus power. Based on our fieldanalysis of how flash memory errors manifest when running modern workloads on modern SSDs, we make several major observations and find that SSD failure rates do not increase monotonically with flashchip wear, but instead they go through several distinct periods corresponding to how failures emerge andare subsequently detected.For network devices, we perform a large scale, longitudinal study of data center network reliabilitybased on operational data collected from the production network infrastructure at Facebook. Our studycovers reliability characteristics of both intra and inter data center networks. For intra data center networks,we study seven years of operation data comprising thousands of network incidents across twodifferent data center network designs, a cluster network design and a state-of-the-art fabric network design.For inter data center networks, we study eighteen months of recent repair tickets from the field tounderstand the reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we studythe effects of network reliability on software systems, and how these reliability characteristics evolve overtime. We discuss the implications of network reliability on the design, implementation, and operation oflarge scale data center systems and how the network affects highly-available web services.Our key conclusion in this dissertation is that we can gain a deep understanding of why devicesfail—and how to predict their failure—using measurement and modeling. We hope that the analysis,techniques, and models we present in this dissertation will enable the community to better measure,understand, and prepare for the hardware reliability challenges we face in the future.

Read full abstract

Modern Workloads Research Articles

Related Topics

Articles published on Modern Workloads

Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives

NDP-RANK: Prediction and ranking of NDP systems performance using machine learning

Velox

Diagnosing the coexistence of Poissonity and self-similarity in memory workloads

MoRS: An Approximate Fault Modeling Framework for Reduced-Voltage SRAMs

Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators

Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

Rearchitecting in-memory object stores for low latency

Resiliency Engineering in Cloud-Native Environments: Fail-Safe Mechanisms for Modern Workloads

A Survey on Domain-Specific Memory Architectures

Extended performance accounting using Valgrind tool

Using Human-Agent Teams to Purposefully Design Multi-Agent Systems

Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device Lending

Speeding up Collective Communications Through Inter-GPU Re-Routing

Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Making Huge Pages Actually Useful

Performance innovations in the IBM z14 platform

Domino Cache

Dynamo

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Modern Workloads Research Articles

Related Topics

Articles published on Modern Workloads

Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives

NDP-RANK: Prediction and ranking of NDP systems performance using machine learning

Velox

Diagnosing the coexistence of Poissonity and self-similarity in memory workloads

MoRS: An Approximate Fault Modeling Framework for Reduced-Voltage SRAMs

Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators

Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

Rearchitecting in-memory object stores for low latency

Resiliency Engineering in Cloud-Native Environments: Fail-Safe Mechanisms for Modern Workloads

A Survey on Domain-Specific Memory Architectures

Extended performance accounting using Valgrind tool

Using Human-Agent Teams to Purposefully Design Multi-Agent Systems

Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device Lending

Speeding up Collective Communications Through Inter-GPU Re-Routing

Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Making Huge Pages Actually Useful

Performance innovations in the IBM z14 platform

Domino Cache

Dynamo