CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme

Xiangren Chen,Shouyi Yin,Bohan Yang,Shaojun Wei,Leibo Liu

doi:10.46586/tches.v2022.i1.94-126

Xiangren Chen, Shouyi Yin + Show 3 more

Open Access

https://doi.org/10.46586/tches.v2022.i1.94-126

Copy DOI

Abstract

Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.

Highlights

In recent years, both industrial and academic community have sparked a boom in the research of post-quantum cryptography and homomorphic encryption
The efficient memory access of in-place Number theoretic transform (NTT) is guaranteed by the devised new conflictfree mapping scheme, which supports the configuration for multiple parallel butterfly units and arbitrary radix
In every stage of NTT computation, if multiple data points are mapped onto the same bank, we need to perform several read or write operations on a single bank address simultaneously, which results in the so-called spatial conflict. [Joh92] proposes a memory mapping scheme for arbitrary radix in-place Fast Fourier Transform (FFT) hardware, which is described in point out that this scheme is not suitable for the case when placing multiple butterfly units in every stage

Summary

Introduction

Both industrial and academic community have sparked a boom in the research of post-quantum cryptography and homomorphic encryption. [LSW01] and [WHEW14] design a dedicated radix-2 FFT and radix-16 NTT architecture based on this method, respectively It is identified in [XMX17] and [ZYC+20] that this scheme cannot be applied to the case when placing multiple parallel butterfly units into every stage. [RV08] puts forward an address mapping scheme supporting arbitrary radix in-place FFT and parallel butterfly units. 3. The efficient memory access of in-place NTT is guaranteed by the devised new conflictfree mapping scheme, which supports the configuration for multiple parallel butterfly units and arbitrary radix. The efficient memory access of in-place NTT is guaranteed by the devised new conflictfree mapping scheme, which supports the configuration for multiple parallel butterfly units and arbitrary radix This scheme avoids introducing queues and pipeline stalls, enabling approximately 100% utilization of pipelined NTT architecture. In contrast to other approaches, this scheme covers broader applicability while still maintaining low logic and storage overhead

NTT-based Multiplication Algorithm

Dimensions of Scalability

Temporal and Spatial Conflicts

DIT Radix-4 NTT with Low Complexity

DIF Radix-4 INTT with Low Complexity

G0 1

Improvement with Divide and Schedule Method

Avoiding Bit-reversed Cost and Reducing Memory Footprint

Proposed DIT-NR Radix-4 NTT

11: Perform the radix-4 DIT-NR NTT

Proposed DIF-RN Radix-4 INTT

Conflict-free Memory Mapping Scheme

Scalable Iterative NTT Algorithm

5: Then the old address a can be expressed in radix r as:

Proposed Solution to Conflict Issue

The Overall Scalable Architecture

Compact Radix-4 Butterfly Unit

Implementation Results and Comparisons

Performance Evaluation Between Radix-4 and Radix-2 NTT

Comparisons with Related Work

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IACR Transactions on Cryptographic Hardware and Embedded Systems	Publication Date: Nov 19, 2021
Citations: 19	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems

Lead the way for us

Similar Papers

A number theoretic transform accelerator with two parallel simplified butterfly units
Buqing Xu ... Jinjiang Yang
-
Buqing Xu, et. al.Buqing Xu ... Jinjiang Yang
28 Jul 2022
28 Jul 2022

Towards Efficient Hardware Implementation of NTT for Kyber on FPGAs
Cong Zhang ... Quming Jiang
-
Cong Zhang, et. al.Cong Zhang ... Quming Jiang
01 May 2021
01 May 2021

Area-Efficient Number Theoretic Transform Architecture for Homomorphic Encryption
Phap Duong-Ngoc ... Hanho Lee
IEEE Transactions on Circuits and Systems I: Regular Papers | VOL. 70
Phap Duong-Ngoc, et. al.Phap Duong-Ngoc ... Hanho Lee
01 Mar 2023
IEEE Transactions on Circuits and Systems I: Regular Papers | VOL. 70

Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT
Neng Zhang ... Chen Chen
IACR Transactions on Cryptographic Hardware and Embedded Systems | VOL. 2020
Neng Zhang, et. al.Neng Zhang ... Chen Chen
02 Mar 2020
IACR Transactions on Cryptographic Hardware and Embedded Systems | VOL. 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems