The rapid advancement of deep neural networks has significantly increased demands for computational complexity and data volume. This trend is especially evident with the emergence of large language models, which have rendered traditional architectures such as CPUs and GPGPUs insufficient in meeting performance and energy efficiency requirements. Spatial accelerators present a promising solution by optimizing on-chip compute, storage, and communication resources. In exploring spatial accelerator design spaces, analytical model-based simulators and cycle-accurate simulators are commonly employed, each offering distinct advantages: high computational speed and superior simulation accuracy, respectively. However, the limited accuracy of analytical models and the slow simulation speed of cycle-accurate simulators impede the achievement of globally optimal solutions during design space exploration. Therefore, effectively leveraging the strengths of both simulator types while mitigating their inherent trade-offs is a critical challenge in designing customized spatial accelerators. In this work, we introduce a novel co-exploration methodology that integrates both coarse-grained and fine-grained approaches to navigate design and mapping spaces effectively. We utilize the rapid simulation capabilities of analytical models to perform coarse-grained global exploration, quickly eliminating designs and mapping configurations with inferior performance. Building on the results of this initial exploration, we employ cycle-accurate simulators to conduct fine-grained local exploration within the identified promising regions of the design and mapping spaces. This dual-phase approach aims to identify optimal hardware designs and dataflow mapping strategies that enhance performance and energy efficiency. The experimental results demonstrate that, compared to state-of-the-art methods, our approach reduces the number of exploration points by up to 99%, while achieving a 17.9% reduction in latency, a 2.5% decrease in energy consumption, and a 30.3% improvement in throughput.
Read full abstract