Cycle-accurate Simulator Research Articles

As an open-sourced instruction set and being flexible in hardware extension, RISC-V begins its pace to enter the world of high performance computing. One of the distinguished feature of processing units adopting RISC-V is its ability to add custom circuits with special purpose accelerators. As the artificial general intelligence becomes practical, AI accelerators become an indispensable part of computing devices, where RISC-V is a great fit for the CPU to glue accelerators together. A system of chip designed by Alibaba T-head is one of the early chip in the massive production adopting RISC-V CPU, where the CPU, named Xuantie-910, has a high performance design with 128-bit RISC-V vector processing units, which are designed for accelerating AI applications. OpenMC has been adapted to run on Xuantie-910. In the Monte Carlo method for reactor physics, fetching the neutron cross sections is the hotspot that takes the majority of the computational burden. The traditional point-wise cross sections are slow because of memory latency caused by accessing many nonconsecutive memory addresses. An AI model for cross section is hence proposed. With 2.2 KB of runtime size, the smallest in the published work, the data can be fetched entirely in the L1 cache during on-the-fly cross section evaluation through single memory read. The in-house AI model also covers the entire energy range, unlike only the resonance range is supported in previous work. So, the effects from memory latency is minimized. The average relative error in AI modeled U-238 elastic cross section is 0.6% from point-wise cross section. With a modified version of OpenMC on Apple M3 Max, for a VERA pin-cell problem, compared to the point-wise cross section, the adoption of AI modeled cross section reduces the total runtime by 7%, although the runtime for calculating U-238 elastic cross section causes 40% more runtime. The adoption of AI modeled U-238 elastic cross section leads to K-effective 302 pcm higher than the case of adoption of point-wise cross sections. Advantage of AI model has been verified. With AI modeled cross section, the neutron slowing down problems with pure elastic scattering on U-238 has been studied on Xuantie-910. The average relative error in 65,536 group fluxes is about 0.9% from using point-wise cross section. However, with accelerating with the 128-bit vector processing units, the performance degrades by 35%, because of the narrow 64-bit load and store interface to the vector register files. The performance with Al modeled cross section is about 1/4 of the case with point-wise cross sections. In addition, the 1,024-bit width Ara RISC-V vector processing has been used to study the cost of AI modeled cross section evaluation. Being able to access the open-sourced hardware design in SystemVerilog, cycle accurate circuit simulation is performed. Using the vector processing units, the cost is reduced to 65% of the case using scalar instructions. The 128-bit load and store interface to vector processing units is a major contributor to the speeding up. The width of the load and store interface to vector processing units should be the main optimization factor in chip design to accelerate the AI modeled cross section evaluation.

Network-on-Chips (NoCs) are the standard on-chip communication fabrics for connecting cores, caches, and memory controllers in multi/many-core systems. With the increase in communication load introduced by emerging parallel computing applications, on-chip communication is becoming more costly than computation in terms of energy consumption. This paper contributes to existing research on approximate communication by proposing a slack-aware packet approximation technique to reduce the energy consumed by NoCs for sustainable parallel computation. The proposed approximation technique lowers both the execution time and NoC power consumption by reducing the packet size based on slack. The slack is the number of cycles by which a packet can be delayed in the network with no effect on execution time. Thus, low-slack packets are considered critical to system performance, and prioritizing these packets during the transmission will significantly reduce execution time. The proposed technique includes a slack-aware control policy to identify low-slack packets and accelerates these packets using two packet approximation mechanisms, namely, an in-network approximation (INAP) and a network interface approximation (NIAP). INAP mechanism prioritizes low-slack packets during the arbitration phase of the router by approximating packets with high-slack. NIAP mechanism reduces the latency of the network links and switch traversals by truncating data for the low-slack packets. An approximate network interface and router are implemented to support the proposed technique with lightweight packet approximation hardware for lower power consumption and execution time. Cycle-accurate simulations using the AxBench and PARSEC benchmark suites show that the proposed approximate communication technique achieves reductions of up to 24% in execution time and 38% in energy consumption with 1.1% less accuracy loss on average compared to existing approximate communication techniques.

Cycle-accurate Simulator Research Articles

Related Topics

Articles published on Cycle-accurate Simulator

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access

EcoFlow: Efficient Convolutional Dataflows on Low-Power Neural Network Accelerators

Performance and energy evaluation of dynamic adaptive deterministic routing algorithm for multicore architectures

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Load Balanced PIM-Based Graph Processing

Technological Prerequisites and Consequences of Ubiquitous Computing and Networking in Resurrecting Extinct Computers

On Hardware Flexibility and Heterogeneity: A Vision for Monte Carlo Codes on Incoming RISC-V Computing Devices with AI-based Cross Section

BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration

Locally-Adaptive Level-of-Detail for Hardware-Accelerated Ray Tracing

Exploring Instruction Set Architectural Variations: x86, ARM, and RISC-V in Compute-Intensive Applications

PH-ORAM: An efficient persistent ORAM design for hybrid memory systems

Fast Performance Analysis for NoCs With Weighted Round-Robin Arbitration and Finite Buffers

CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework

FourierPIM: High-throughput in-memory Fast Fourier Transform and polynomial multiplication

Early DSE and Automatic Generation of Coarse-grained Merged Accelerators

Worst-Case Communication Time Analysis for On-Chip Networks With Finite Buffers

Slack-Aware Packet Approximation for Energy-Efficient Network-on-Chips

A Technique for Approximate Communication in Network-on-Chips for Image Classification

TPPD: Targeted Pseudo Partitioning based Defence for cross-core covert channel attacks

Enabling Reduced Simpoint Size Through LiveCache and Detail Warmup

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cycle-accurate Simulator Research Articles

Related Topics

Articles published on Cycle-accurate Simulator

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access

EcoFlow: Efficient Convolutional Dataflows on Low-Power Neural Network Accelerators

Performance and energy evaluation of dynamic adaptive deterministic routing algorithm for multicore architectures

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Load Balanced PIM-Based Graph Processing

Technological Prerequisites and Consequences of Ubiquitous Computing and Networking in Resurrecting Extinct Computers

On Hardware Flexibility and Heterogeneity: A Vision for Monte Carlo Codes on Incoming RISC-V Computing Devices with AI-based Cross Section

BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration

Locally-Adaptive Level-of-Detail for Hardware-Accelerated Ray Tracing

Exploring Instruction Set Architectural Variations: x86, ARM, and RISC-V in Compute-Intensive Applications

PH-ORAM: An efficient persistent ORAM design for hybrid memory systems

Fast Performance Analysis for NoCs With Weighted Round-Robin Arbitration and Finite Buffers

CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework

FourierPIM: High-throughput in-memory Fast Fourier Transform and polynomial multiplication

Early DSE and Automatic Generation of Coarse-grained Merged Accelerators

Worst-Case Communication Time Analysis for On-Chip Networks With Finite Buffers

Slack-Aware Packet Approximation for Energy-Efficient Network-on-Chips

A Technique for Approximate Communication in Network-on-Chips for Image Classification

TPPD: Targeted Pseudo Partitioning based Defence for cross-core covert channel attacks

Enabling Reduced Simpoint Size Through LiveCache and Detail Warmup