Data-parallel problems demand ever growing floating-point (FP) operations per second under tight area- and energy-efficiency constraints. In this work, we present Manticore, a general-purpose, ultraefficient chiplet-based architecture for data-parallel FP workloads. We have manufactured a prototype of the chiplet’s computational core in Globalfoundries 22FDX process and demonstrate more than 5x improvement in energy efficiency on FP intensive workloads compared to CPUs and GPUs. The compute capability at high energy and area efficiency is provided in “Snitch: A tiny pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads,” IEEE Trans. Comput., containing eight small integer cores, each controlling a large floating-point unit (FPU). The core supports two custom ISA extensions: The SSRs extension elides explicit load and store instructions by encoding them as register reads and writes (“Stream semantic registers: A lightweight RISC-V ISA extension achieving full compute utilization in single-issue cores,” IEEE Trans. Comput.). The floating-point repetition extension decouples the integer core from the FPU allowing floating-point instructions to be issued independently. These two extensions allow the single-issue core to minimize its instruction fetch bandwidth and saturate the instruction bandwidth of the FPU, achieving FPU utilization above 90%, with more than 40% of core area dedicated to the FPU.
Read full abstract