Chip Multiprocessor System Research Articles

DRAM has been a de facto standard for main memory, and advances in process technology have led to a rapid increase in its capacity and bandwidth. In contrast, its random access latency has remained relatively stagnant, as it is still around 100 CPU clock cycles. Modern computer systems rely on caches or other latency tolerance techniques to lower the average access latency. However, not all applications have ample parallelism or locality that would help hide or reduce the latency. Moreover, applications' demands for memory space continue to grow, while the capacity gap between last-level caches and main memory is unlikely to shrink. Consequently, reducing the main-memory latency is important for application performance. Unfortunately, previous proposals have not adequately addressed this problem, as they have focused only on improving the bandwidth and capacity or reduced the latency at the cost of significant area overhead. We propose asymmetric DRAM bank organizations to reduce the average main-memory access latency. We first analyze the access and cycle times of a modern DRAM device to identify key delay components for latency reduction. Then we reorganize a subset of DRAM banks to reduce their access and cycle times by half with low area overhead. By synergistically combining these reorganized DRAM banks with support for non-uniform bank accesses, we introduce a novel DRAM bank organization with center high-aspect-ratio mats called CHARM. Experiments on a simulated chip-multiprocessor system show that CHARM improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.

Read full abstract

Configuring a million-core parallel system at boot time is a difficult process when the system has neither specialised hardware support for the configuration process nor a preconfigured default state that puts it in operating condition. The architecture of SpiNNaker, a parallel chip multiprocessor (CMP) system for neural network simulation, is in this class. To function as a universal neural chip, SpiNNaker uses an event-driven model with complete system virtualisation so that all components are generic and identical. Where most large CMP systems feature a sideband network to complete the boot process, SpiNNaker has a single homogeneous network interconnect for both application inter-processor communications and system control functions. This network improves fault tolerance and makes it easier to support dynamic run-time reconfiguration, however, it requires a boot process compatible with the application’s communications model. Here, we present such a boot loader, capable of bringing a generic, initially unconfigured parallel system into a working configuration. Since SpiNNaker uses event-driven asynchronous communications throughout, the loader operates with purely local control: there is no global synchronisation, state information, or transition sequence. A novel two-stage “unfolding” boot-up process efficiently configures the SpiNNaker hardware and loads the application using a high-speed flood-fill technique with support for run-time reconfiguration. SystemC simulation of a multi-CMP SpiNNaker system indicates an error-free CMP configuration time of ∼1.37 ms, while a high-level simulation of a full-scale system (64 K CMPs) indicates a mean application-loading time of ∼20 ms (for a 100 KB application), which is virtually independent of the size of the system. Further hardware-level Verilog simulation verified the cycle-accurate functionality of CMP configuration. The complete process illustrates a useful method for configuring large-scale event-driven parallel systems without having to provide dedicated hardware boot support or rely on system state assumptions.

Read full abstract

Chip Multiprocessor System Research Articles

Related Topics

Articles published on Chip Multiprocessor System

Run-time adaptive data page mapping: A Comparison with 3D-stacked DRAM cache

Dynamic Data Allocation and Task Scheduling on Multiprocessor Systems With NVM-Based SPM

Dynamic energy management for chip multi-processors under performance constraints

Path Selection for Real-Time Communication on Priority-Aware NoCs

The efficiency of buffer and buffer-less data-flow control schemes for congestion avoidance in Networks on Chip

Phase-Change Memory Optimization for Green Cloud with Genetic Algorithm

Network aware performance evaluation of prefetching techniques in CMPs

A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies

Study of Various Factors Affecting Performance of Multi-Core Processors

Reducing memory access latency with asymmetric DRAM bank organizations

TLB Improvements for Chip Multiprocessors

Three-phase time-aware energy minimization with DVFS and unrolling for Chip Multiprocessors

Improving System Energy Efficiency with Memory Rank Subsetting

Parallel Simulations of Dynamic Earthquake Rupture along Geometrically Complex Faults on CMP Systems

Event-driven configuration of a neural network CMP system over an homogeneous interconnect fabric

Location Cache Design and Performance Analysis for Chip Multiprocessors

Putting Faulty Cores to Work

Design and Analysis of Location Caches in a NoC-Based Chip Multiprocessor System

Understanding sources of inefficiency in general-purpose chips

Fairness via source throttling

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Chip Multiprocessor System Research Articles

Related Topics

Articles published on Chip Multiprocessor System

Run-time adaptive data page mapping: A Comparison with 3D-stacked DRAM cache

Dynamic Data Allocation and Task Scheduling on Multiprocessor Systems With NVM-Based SPM

Dynamic energy management for chip multi-processors under performance constraints

Path Selection for Real-Time Communication on Priority-Aware NoCs

The efficiency of buffer and buffer-less data-flow control schemes for congestion avoidance in Networks on Chip

Phase-Change Memory Optimization for Green Cloud with Genetic Algorithm

Network aware performance evaluation of prefetching techniques in CMPs

A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies

Study of Various Factors Affecting Performance of Multi-Core Processors

Reducing memory access latency with asymmetric DRAM bank organizations

TLB Improvements for Chip Multiprocessors

Three-phase time-aware energy minimization with DVFS and unrolling for Chip Multiprocessors

Improving System Energy Efficiency with Memory Rank Subsetting

Parallel Simulations of Dynamic Earthquake Rupture along Geometrically Complex Faults on CMP Systems

Event-driven configuration of a neural network CMP system over an homogeneous interconnect fabric

Location Cache Design and Performance Analysis for Chip Multiprocessors

Putting Faulty Cores to Work

Design and Analysis of Location Caches in a NoC-Based Chip Multiprocessor System

Understanding sources of inefficiency in general-purpose chips

Fairness via source throttling