LACX: Locality-Aware Shared Data Migration in NUMA + CXL Tiered Memory
In modern high-performance computing (HPC) and large-scale data processing environments, the efficient utilization and scalability of memory resources are critical determinants of overall system performance. Architectures such as non-uniform memory access (NUMA) and tiered memory systems frequently suffer performance degradation due to remote accesses stemming from shared data among multiple tasks. This paper proposes LACX, a shared data migration technique leveraging Compute Express Link (CXL), to address these challenges. LACX preserves the migration cycle of automatic NUMA balancing (AutoNUMA) while identifying shared data characteristics and migrating such data to CXL memory instead of DRAM, thereby maximizing DRAM locality. The proposed method utilizes existing kernel structures and data to efficiently identify and manage shared data without incurring additional overhead, and it effectively avoids conflicts with AutoNUMA policies. Evaluation results demonstrate that, although remote accesses to shared data can degrade performance in low-tier memory scenarios, LACX significantly improves overall memory bandwidth utilization and system performance in high-tier memory and memory-intensive workload environments by distributing DRAM bandwidth. This work presents a practical, lightweight approach to shared data management in tiered memory environments and highlights new directions for next-generation memory management policies.
- Conference Article
3
- 10.1109/trustcom.2016.0187
- Aug 1, 2016
The multi-core architectures are nowadays characterized by Non-Uniform Memory Access (NUMA). Efficiently exploiting such architectures is extremely complicated for programmers. Multi-threaded programs may encounter high memory access latency if the mapping of data and computing is not considered carefully on such systems. Programmers need tools to detect performance problems if high memory access latency occurs. To address this need, we present a profiling tool called LaProf, which uses memory access latency information to detect performance problems on NUMA systems. This tool can be used to detect three performance problems of multi-threaded programs, which are: 1) data sharing. Shared data will cause remote memory access if threads which access the shared data are not allocated on the same node of NUMA systems, 2) shared resource contention. High memory access latency will influence the performance severely if contention happens on shared resources, such as last-level caches, inter-connect links and memory controllers, 3) remote access imbalance. The thread which has the most number of remote data access becomes the critical thread which lags down the overall performance of multi-threaded program. After the detection done by LaProf, using simple and general NUMA optimization techniques, the performance improvement for each problem is 88%, 32%, 99% respectively.
- Book Chapter
- 10.1007/978-981-13-2853-4_43
- Jan 1, 2018
In recent years, the size and complexity of the datasets generated by the large-scale numerical simulations using modern HPC (High Performance Computing) systems have continuously increasing. These generated datasets can possess different formats, types, and attributes. In this work, we have focused on the large-scale distributed unstructured volume datasets, which are still applied on numerical simulations in a variety of scientific and engineering fields. Although volume rendering is one of the most popular techniques for analyzing and exploring a given volume data, in the case of unstructured volume data, the time-consuming visibility sorting becomes problematic as the data size increases. Focusing on an effective volume rendering of large-scale distributed unstructured volume datasets generated in HPC environments, we opted for using the well-known PBVR (Particle-based Volume Rendering) method. Although PBVR does not require any visibility sorting during the rendering process, the CPU-based approach has a notorious image quality and memory consumption tradeoff. This is because that the entire set of the intermediate rendering primitives (particles) was required to be stored a priori to the rendering processing. In order to minimize the high pressure on the memory consumption, we propose a fully parallel PBVR approach, which eliminates the necessity for storing these intermediate rendering primitives, as required by the existing approaches. In the proposed method, each set of the rendering primitives is directly converted to a partial image by the processes, and then they are gathered and merged by the utilized parallel image composition library (234Compositor). We evaluated the memory cost and processing time by using a real CFD simulation result, and we could verify the effectiveness of our proposed method compared to the already existing parallel PBVR method.
- Conference Article
6
- 10.1109/compsac.2018.10264
- Jul 1, 2018
Multicore systems with Non-Uniform Memory Access (NUMA) architecture are getting popular in computer systems. Processor cores take longer time to access memories on remote NUMA nodes than on local nodes. Because of operating system kernel's memory allocation and load balancing activities, a process may be migrated across nodes and its allocated memory may scatter on several nodes. Remote memory access and resource contention cause significant performance degradation in multicore NUMA systems. In this study, to reduce contention for inter-node interconnect links and decrease remote memory access in NUMA systems, we enhance the kernel's inter-node load balancing by migrating suitable processes or light-weight processes (i.e. threads) between NUMA nodes. We further propose various selection policies that select processes for migration according to their memory usages, so that the selected process uses the least amount of page frames in the system and/or shares the least amount of memory with other processes. These processes are expected to incur less remote memory access and have least influence caused by migration. The Linux kernel is modified to incorporate the proposed policies in the enhanced inter-node load balancing procedure. Experimental results demonstrate that the system performance is successfully improved by effectively reducing remote memory access and resource contention.
- Conference Article
7
- 10.1109/cluster.2018.00023
- Sep 1, 2018
Modern High Performance Computing (HPC) applications, such as Earth science simulations, produce large amounts of data due to the surging of computing power, while big data applications have become more compute-intensive due to increasingly sophisticated analysis algorithms. The needs of both HPC and big data technologies for advanced HPC and big data applications create a demand for integrated system support. In this study, we introduce Scientific Data Processing (SciDP) to support both HPC and big data applications via integrated scientific data processing. SciDP can directly process scientific data stored on a Parallel File System (PFS), which is typically deployed in an HPC environment, in a big data programming environment running atop Hadoop Distributed File System (HDFS). SciDP seamlessly integrates PFS, HDFS, and the widely-used R data analysis system to support highly efficient processing of scientific data. It utilizes the merits of both PFS and HDFS for fast data transfer, overlaps computing with data accessing, and integrates R into the data transfer process. Experimental results show that SciDP accelerates analysis and visualization of a production NASA Center for Climate Simulation (NCCS) climate and weather application by 6x to 8x when compared to existing solutions.
- Conference Article
37
- 10.1109/ipdps.2015.83
- May 1, 2015
The viability and benefits of running MapReduce over modern High Performance Computing (HPC) clusters, with high performance interconnects and parallel file systems, have attracted much attention in recent times due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Most HPC clusters follow the traditional Beowulf architecture with a separate parallel storage system (e.g. Lustre) and either no, or very limited, local storage. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage system in HPC clusters poses many new opportunities and challenges. In this paper, we propose a novel high-performance design for running YARN MapReduce on such HPC clusters by utilizing Lustre as the storage provider for intermediate data. We identify two different shuffle strategies, RDMA and Lustre Read, for this architecture and provide modules to dynamically detect the best strategy for a given scenario. Our results indicate that due to the performance characteristics of the underlying Lustre setup, one shuffle strategy may outperform another in different HPC environments, and our dynamic detection mechanism can deliver best performance based on the performance characteristics obtained during runtime of job execution. Through this design, we can achieve 44% performance benefit for shuffle-intensive workloads in leadership-class HPC systems. To the best of our knowledge, this is the first attempt to exploit performance characteristics of alternate shuffle strategies for YARN MapReduce with Lustre and RDMA.
- Conference Article
1
- 10.1145/3471873.3472974
- Aug 22, 2021
As the number of cores increases Non-Uniform Memory Access (NUMA) is becoming increasingly prevalent in general purpose machines. Effectively exploiting NUMA can significantly reduce memory access latency and thus runtime by 10-20%, and profiling provides information on how to optimise. Language-level NUMA profilers are rare, and mostly profile conventional languages executing on Virtual Machines. Here we profile, and develop new NUMA profilers for, a functional language executing on a runtime system. We start by using existing OS and language level tools to systematically profile 8 benchmarks from the GHC Haskell nofib suite on a typical NUMA server (8 regions, 64 cores). We propose a new metric: NUMA access rate that allows us to compare the load placed on the memory system by different programs, and use it to contrast the benchmarks. We demonstrate significant differences in NUMA usage between computational and data-intensive benchmarks, e.g. local memory access rates of 23% and 30% respectively. We show that small changes to coordination behaviour can significantly alter NUMA usage, and for the first time quantify the effectiveness of the GHC 8.2 NUMA adaption. We identify information not available from existing profilers and extend both the numaprof profiler, and the GHC runtime system to obtain three new NUMA profiles: OS thread allocation locality, GC count (per region and generation) and GC thread locality. The new profiles not only provide a deeper understanding of program memory usage, they also suggest ways that GHC can be adapted to better exploit NUMA architectures.
- Research Article
- 10.1007/s11771-012-1270-4
- Aug 1, 2012
- Journal of Central South University
Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.
- Conference Article
8
- 10.1109/micro50266.2020.00082
- Oct 1, 2020
Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized - resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we propose the Selective Replication (SelRep) LLC which selectively replicates shared read-only data across LLC slices to improve bandwidth supply while ensuring that the LLC retains sufficient capacity to keep shared data cached. The compile-time component of SelRep LLC uses dataflow analysis to identify read-only shared data structures and uses a special-purpose load instruction for these accesses. The runtime component of SelRep LLC then monitors the caching behavior of these loads. Leveraging an analytical model, SelRep LLC chooses a replication degree that carefully balances the effective LLC bandwidth benefits of replication against its capacity cost. SelRep LLC consistently provides high performance to replication-sensitive applications across different data set sizes. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the shared LLC baseline and the state-of-the-art Adaptive LLC, respectively.
- Research Article
9
- 10.1007/s10586-017-1015-0
- Jul 6, 2017
- Cluster Computing
Over the last several years, many sequence alignment tools have appeared and become popular for the fast evolution of next generation sequencing technologies. Obviously, researchers that use such tools are interested in getting maximum performance when they execute them in modern infrastructures. Today’s NUMA (Non-uniform memory access) architectures present major challenges in getting such applications to achieve good scalability as more processors/cores are used. The memory system in NUMA systems shows a high complexity and may be the main cause for the loss of an application’s performance. The existence of several memory banks in NUMA systems implies a logical increase in latency associated with the accesses of a given processor to a remote bank. This phenomenon is usually attenuated by the application of strategies that tend to increase the locality of memory accesses. However, NUMA systems may also suffer from contention problems that can occur when concurrent accesses are concentrated on a reduced number of banks. Sequence alignment tools use large data structures to contain reference genomes to which all reads are aligned. Therefore, these tools are very sensitive to performance problems related to the memory system. The main goal of this study is to explore the trade-offs between data locality and data dispersion in NUMA systems. We have performed experiments with several popular sequence alignment tools on two widely available NUMA systems to assess the performance of different memory allocation policies and data partitioning strategies. We find that there is not one method that is best in all cases. However, we conclude that memory interleaving is the memory allocation strategy that provides the best performance when a large number of processors and memory banks are used. In the case of data partitioning, the best results are usually obtained when the number of partitions used is greater, sometimes combined with an interleave policy.
- Research Article
45
- 10.1109/tsg.2016.2647220
- May 1, 2017
- IEEE Transactions on Smart Grid
Dynamic simulation for transient stability assessment is one of the most important, but intensive, computational tasks for power system planning and operation. Several commercial software tools provide functionality for performing multiple dynamic simulations such as those in contingency analysis simultaneously on parallel computers. Nevertheless, a single dynamic simulation is still a time consuming process performed sequentially on one single computing core as the tools were originally designed. Modern high performance computing (HPC) holds the promise to accelerate a single dynamic simulation by parallelizing its kernel algorithms without compromising computational accuracy. Parallelizing a single dynamic simulation is a much more challenging problem than the contingency-type parallel computing. It requires a good match between simulation algorithms and computing hardware. This paper provides guidance for such a match so as to design and implement parallel dynamic simulation to maximize the utilization of computing hardware and the performance of the simulation. The guidance is derived through comparative implementation of four parallel dynamic simulation schemes in two state-of-the-art HPC environments: 1) message passing interface and 2) open multi-processing. The scalability and speedup performance of parallelized dynamic simulation are thoroughly studied to determine the impact of simulation algorithms and computing hardware configurations. Several testing cases are presented to illustrate the derived guidance.
- Research Article
4
- 10.1002/spe.2731
- Jul 19, 2019
- Software: Practice and Experience
SummaryAlthough nonuniform memory access architecture provides better scalability for multicore systems, cores accessing memory on remote nodes take longer than those accessing on local nodes. Remote memory access accompanied by contention for internode interconnection degrades performance. Properly mapping threads to cores and data accessed to their nodes can substantially improve performance and energy efficiency. However, an operating system kernel's load‐balancing activity may migrate threads across nodes, which thus messes up the thread mapping. Besides, subsequent data mapping behavior pays for the cost of page migration to reduce remote memory access. Once unsuitable threads are migrated, it is detrimental to system performance. This paper focuses on improving the kernel's internode load balancing on nonuniform memory access systems. We develop a memory‐aware kernel mechanism and policies to reduce remote memory access incurred by internode thread migration. The Linux kernel's load balancing mechanism is modified to incorporate selection policies in the internode thread migration, and the kernel is modified to track the amount of memory used by each thread on each node. With this information, well‐designed policies can then choose suitable threads for internode migration. The purpose is to avoid migrating a thread that might incur relatively more remote memory access and page migration. The experimental results show that with our mechanism and the proposed selection policies, the system performance is substantially increased when compared with the unmodified Linux kernel that does not consider memory usage and always migrates the first‐fit thread in the runqueue that can be migrated to the target central processing unit.
- Research Article
6
- 10.3934/electreng.2019.3.233
- Jan 1, 2019
- AIMS Electronics and Electrical Engineering
The tremendous big data and IP traffic growth rate between interconnected Data Centers (DC) and High Performance Computing (HPC) environments have imposed the need for ultrahigh link capacities and ultrahigh packet switching speeds, at network nodes. In order to overcome these ultrahigh demands, and particularly packet routing and forwarding speeds, long tested and established technologies, such as optical switching and labeling technology, seem to provide adequate solutions, not only by conveying ultrahigh bit rate data streams, but also by achieving multi Tb/s cross connection throughputs, in a cost and energy efficient way. By adoption of optical switching and labeling technology, big data streams are propagating directly in optical layer, thus lessening down bottlenecks, latency issues, and multi stage hierarchy layering. This paper, apart from optical switching and labeling potentials, investigates thoroughly other critical issues, strictly related to the proper choice of employing a switching architecture layout, such as its implementation technology, its elasticity potentials, in terms of flexible bandwidth (BW) provisioning, its adopted control plane lying on top of data infrastructure plane, providing cognition, control and orchestration over its network elements, as well as related to the proper choice of optical labeling techniques adopted, in conjunction with current, advanced, coherent, multi level modulation formats, for ultrahigh link capacities and packet switching speed demands of scalable, big data interconnected DCs and HPC environments.
- Research Article
4
- 10.1371/journal.pone.0188428
- Nov 21, 2017
- PloS one
As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget.
- Research Article
6
- 10.1109/access.2021.3069991
- Jan 1, 2021
- IEEE Access
Linux is becoming the de-facto standard operating system for today’s high-performance computing (HPC) systems because it can satisfy the demands of many HPC systems for rich operating system (OS) features. However, owing to features intended for the general-purpose OS, Linux has many OS noise sources such as page faults or thread migrations that can result in the unstable performance of HPC application. Furthermore, in the case of the non-uniform memory access (NUMA) architecture, which has different memory access latencies to local and remote memory nodes, the performance stability of the application can be more exacerbated by the OS noise. In this paper, we address the OS noise caused by Linux in the NUMA architecture and propose a novel performance-stable NUMA management scheme called Stable-NUMA . Stable-NUMA comprises three techniques for improving performance stability: two-level thread clustering, state-based page placement, and selective page profiling. Our proposed Stable-NUMA scheme significantly alleviates OS noise and enhances the local memory access ratio of the NUMA system as compared to Linux. We implemented Stable-NUMA in Linux and experimented with various HPC workloads. The evaluation results demonstrated that Stable-NUMA outperforms Linux with and without its NUMA-aware feature by up to 25% in terms of average performance and 73% in terms of performance stability.
- Research Article
64
- 10.1007/s10723-012-9219-2
- Jul 27, 2012
- Journal of Grid Computing
Virtualized datacenters and clouds are being increasingly considered for traditional High-Performance Computing (HPC) workloads that have typically targeted Grids and conventional HPC platforms. However, maximizing energy efficiency and utilization of datacenter resources, and minimizing undesired thermal behavior while ensuring application performance and other Quality of Service (QoS) guarantees for HPC applications requires careful consideration of important and extremely challenging tradeoffs. Virtual Machine (VM) migration is one of the most common techniques used to alleviate thermal anomalies (i.e., hotspots) in cloud datacenter servers as it reduces load and, hence, the server utilization. In this article, the benefits of using other techniques such as voltage scaling and pinning (traditionally used for reducing energy consumption) for thermal management over VM migrations are studied in detail. As no single technique is the most efficient to meet temperature/performance optimization goals in all situations, an autonomic approach that performs energy-efficient thermal management while ensuring the QoS delivered to the users is proposed. To address the problem of VM allocation that arises during VM migrations, an innovative application-centric energy-aware strategy for Virtual Machine (VM) allocation is proposed. The proposed strategy ensures high resource utilization and energy efficiency through VM consolidation while satisfying application QoS by exploiting knowledge obtained through application profiling along multiple dimensions (CPU, memory, and network bandwidth utilization). To support our arguments, we present the results obtained from an experimental evaluation on real hardware using HPC workloads under different scenarios.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.