Compute-intensive Workloads Research Articles

Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE).This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome). Overall, the experimental evaluation shows that Graviton3 outperforms other machines on average. Moreover, we observed that the performance of the A64FX is significantly constrained by its small memory hierarchy and latencies. Additionally, as proof of concept, we study the performance of a production-ready tool that exploits two of the ported and optimized genomic kernels.

Artificial intelligence (AI) and machine learning (ML) have emerged as the fastest growing workloads ranging from applications like object detection, natural language processing and facial recognition to self-driving cars. The proliferation of these compute-intensive workloads resulted in numerous hardware accelerators to fill the gap between the performance and energy-efficiency requirements of AI applications and the capabilities of current architectures like CPU and GPU. In most cases these accelerators are specialized for a particular task, are costly to produce, require special programming tools, and can become obsolete as new ML algorithms are introduced. To solve these problems, we present EXTREM-EDGE, a hardware/software co-design approach to add custom extensions to the open-source RISC-V Instruction Set architecture (ISA) for designing a scalable and flexible ML processor architecture. EXTREM-EDGE augments the RISC-V processor with hardware AI functional units (AFU) along with ISA extensions which directly target these AFUs. EXTREM-EDGE is a system-level solution which is easy to program, enables royalty-free production and provides flexibility for future workloads. It enables the designers to quickly adapt to any hardware or ISA/software changes and allows the design-space exploration of various available hardware, instructions and software options. This enables a processor architecture which addresses the requirements of current AI/ML workloads, gives the flexibility to hot-swap AFUs when better hardware is available and scales with new AI instructions in response to rapidly evolving AI algorithms while providing a streamlined development flow for both hardware and software. EXTREM-EDGE provides 1.75x (MAC) to 17.63x (PIM VMM) performance improvements for a GEMV kernel and 1.41x (MAC) to 4.41x (PIM VMM) reductions in processor clock cycles for ResNet-8 neural network model from MLPerf Tiny benchmark depending upon the size of added accelerators and complexity of added instructions.

Compute-intensive Workloads Research Articles

Related Topics

Articles published on Compute-intensive Workloads

GenArchBench: A genomics benchmark suite for arm HPC processors

Exploring Instruction Set Architectural Variations: x86, ARM, and RISC-V in Compute-Intensive Applications

Serverless High-Performance Computing over Cloud

EXTREM-EDGE—EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT

An Energy-Efficient 3D Cross-Ring Accelerator With 3D-SRAM Cubes for Hybrid Deep Neural Networks

Enhancing Performance and Energy Efficiency for Hybrid Workloads in Virtualized Cloud Environment

Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks

Implementing Practical DNN-Based Object Detection Offloading Decision for Maximizing Detection Performance of Mobile Edge Devices

Development of benchmark automation suite and evaluation of various high-performance computing systems

ASIC clouds

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

Dynamic multi-user computation offloading for wireless powered mobile edge computing

Contention-Aware Fair Scheduling for Asymmetric Single-ISA Multicore Systems

Morpheus

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds

Replicated Computations Results (RCR) Report for “A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds”

Moonwalk

Moonwalk

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Compute-intensive Workloads Research Articles

Related Topics

Articles published on Compute-intensive Workloads

GenArchBench: A genomics benchmark suite for arm HPC processors

Exploring Instruction Set Architectural Variations: x86, ARM, and RISC-V in Compute-Intensive Applications

Serverless High-Performance Computing over Cloud

EXTREM-EDGE—EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT

An Energy-Efficient 3D Cross-Ring Accelerator With 3D-SRAM Cubes for Hybrid Deep Neural Networks

Enhancing Performance and Energy Efficiency for Hybrid Workloads in Virtualized Cloud Environment

Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks

Implementing Practical DNN-Based Object Detection Offloading Decision for Maximizing Detection Performance of Mobile Edge Devices

Development of benchmark automation suite and evaluation of various high-performance computing systems

ASIC clouds

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

Dynamic multi-user computation offloading for wireless powered mobile edge computing

Contention-Aware Fair Scheduling for Asymmetric Single-ISA Multicore Systems

Morpheus

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds

Replicated Computations Results (RCR) Report for “A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds”

Moonwalk

Moonwalk