This article describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We explore the impacts of different pipelining strategies within DSP blocks, scheduling of operations, processing element configurations, global design structures and their tradeoffs in terms of performance and resource costs. We discuss the application of our methodology to multiple types of DSP primitives. We provide ready-to-use fast, efficient, and fully parametrizable designs, which can adapt to a wide range of requirements and applications. Implementations are scalable to any operand width. Our most efficient designs can perform 128, 256, 512, 1024, 2048, and 4096 bits Montgomery modular multiplications in 0.0992 μs, 0.2032 μs, 0.3952 μs, 0.7792μs, 1.550 μs, and 3.099 μs using 4, 6, 11, 21, 41, and 82 DSP blocks, respectively.
Read full abstract