Abstract

Associative processing (AP) is a promising PIM paradigm that overcomes the von Neumann bottleneck (memory wall) by virtue of a radically different execution model. By decomposing arbitrary computations into a sequence of primitive memory operations (i.e., search and write), AP’s execution model supports concurrent SIMD computations in-situ in the memory array to eliminate the need for data movement. This execution model also provides a native support for flexible data types and only requires a minimal modification on the existing memory design (low hardware complexity). Despite these advantages, the execution model of AP has two limitations that substantially increase the execution time, i.e., 1) it can only search a single pattern in one search operation and 2) it needs to perform a write operation after each search operation. In this paper, we propose the Highly Performant Associative Processor (Hyper- AP) to fully address the aforementioned limitations. The core of Hyper- AP is an enhanced execution model that reduces the number of search and write operations needed for computations, thereby reducing the execution time. This execution model is generic and improves the performance for both CMOS-based and RRAM-based AP, but it is more beneficial for the RRAMbased AP due to the substantially reduced write operations. We then provide complete architecture and micro-architecture with several optimizations to efficiently implement Hyper-AP. In order to reduce the programming complexity, we also develop a compilation framework so that users can write C-like programs with several constraints to run applications on Hyper- AP. Several optimizations have been applied in the compilation process to exploit the unique properties of Hyper- AP. Our experimental results show that, compared with the recent work IMP, Hyper- AP achieves up to 54×/4.4× better power-/area-efficiency for various representative arithmetic operations. For the evaluated benchmarks, Hyper-AP achieves 3.3× speedup and 23.8× energy reduction on average compared with IMP. Our evaluation also confirms that the proposed execution model is more beneficial for the RRAM-based AP than its CMOS-based counterpart.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call