Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

Amin Abdulrahman,Bo-Yin Yang,Matthias J Kannwischer,Vincent Hwang,Jiun-Peng Chen,Yu-Jia Chen

doi:10.46586/tches.v2022.i1.127-151

Abstract

The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.

Highlights

Shor’s algorithm [Sho97] threatens all widely deployed public-key cryptography as it solves the integer factorization and the discrete logarithm problems on a quantum computer
Saber is based on the module learning with rounding (M-LWR) problem
We show that stack optimization on Cortex-M4 can be applied to the 16-bit number-theoretic transforms (NTTs) approach on Cortex-M3

Summary

Introduction

Shor’s algorithm [Sho97] threatens all widely deployed public-key cryptography as it solves the integer factorization and the discrete logarithm problems on a quantum computer. Can we achieve a smaller memory footprint for Saber with NTTs compared to the [MKV20] stack-optimized Toom–Cook implementation?. The [CHK+21] implementation relies on one of the multiplicands being small and only computes the correct 25-bit result This is true for the secrets in Saber, but it does not apply to masked implementations in which the secret is arithmetically shared modulo q (e.g., [VBDK+20]). The open question is: Should Saber implementations targeting the Cortex-M3 use NTTs?. We point out an overlooked stack optimization with multi-moduli NTTs. The optimization justifies an unconventional use of composite-modulus for unmasked Saber and unequal-size NTTs for masked Saber that have not been implemented before. There is a line of work optimizing Saber for the Cortex-M4 [KRS19, MKV20, CHK+21] using Karatsuba, Toom–Cook, and lately NTTs. A masked Saber is presented by Van Beirendonck et al in [VBDK+20].

Preliminaries

ARM Cortex-M4 and Cortex-M3

Montgomery multiplication

Number-Theoretic Transform

Explicit Chinese remainder theorem computations

Explicit formulations for NTTs

Multi-moduli NTTs to save memory

Prior uses of multi-moduli

Polynomial multiplication

Discrete weighted transform

Cooley–Tukey and Gentleman–Sande FFTs

CT for NTT and GS for iNTT

GS for NTT and CT for iNTT

NTT for NTT-unfriendly rings

Incomplete NTT

NTTs for MatrixVectorMul

Reducing stack usage for MatrixVectorMul

Implementation on M4

New record on stack usage for unmasked Saber

Masked MatrixVectorMul for Saber

Implementation on M3

A Note on combining 32-bit and 16-bit

Result

Performance

Findings

Leakage Evaluation of Masked MatrixVectorMul in Saber

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IACR Transactions on Cryptographic Hardware and Embedded Systems	Publication Date: Nov 19, 2021
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems

Lead the way for us

Similar Papers

Analysis of Block Recombination and Lazy Interpolation Methods and Their Applications to Saber
Berkin Aksoy ... Murat Cenk
-
Berkin Aksoy, et. al.Berkin Aksoy ... Murat Cenk
19 Oct 2022
19 Oct 2022

A fast and secure image encryption algorithm using number theoretic transforms and discrete logarithms
Jeyamala Chandrasekaran ... Thiruvengadam S Jayaraman
-
Jeyamala Chandrasekaran, et. al.Jeyamala Chandrasekaran ... Thiruvengadam S Jayaraman
01 Feb 2015
01 Feb 2015

Efficient number theoretic transform accelerator for CRYSTALS-Kyber
Toan Nguyen ... Hoang Anh Pham
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 33
Toan Nguyen, et. al.Toan Nguyen ... Hoang Anh Pham
01 Feb 2024
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 33

On the Number of Arithmetic Operations in NTT-based Polynomial Multiplication in Kyber and Dilithium Cryptosystems
Murat Burhan Ilter ... Erkan Uslu
-
Murat Burhan Ilter, et. al.Murat Burhan Ilter ... Erkan Uslu
15 Dec 2021
15 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems