Polynomial multiplication on embedded vector architectures

Jose Maria Bermudo Mera,Joseph Yiu,Angshuman Karmakar,Hanno Becker,Ingrid Verbauwhede

doi:10.46586/tches.v2022.i1.482-505

Jose Maria Bermudo Mera, Joseph Yiu + Show 3 more

Open Access

https://doi.org/10.46586/tches.v2022.i1.482-505

Copy DOI

Abstract

High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber

Highlights

The rapidly expanding Internet of Things (IoT) has an unprecedented impact on our digital ecosystem, so much that it is often termed the fourth industrial revolution
We report an ≈ 5× speedup of our striding Toom-Cook/Karatsuba implementation compared to previous Cortex-M4 implementations, and an ≈ 3.5× speedup of our implementation of the Number Theoretic Transform (NTT) compared to the fastest NTT on Cortex-M4 [CHK+21]
Since batching tends to use a large number of vector registers but a low number of general purpose registers (GPR), we found it not suitable for use with the Helium vector extension except for very small computations

Summary

Introduction

The rapidly expanding Internet of Things (IoT) has an unprecedented impact on our digital ecosystem, so much that it is often termed the fourth industrial revolution. The dichotomy of the situation is that tight resource constraints need to be imposed on these devices to allow them to be costeffective, but they limit performance and, importantly, often impede incorporating secure cryptographic protocols Both aspects are closely tied because (publickey) cryptography is computationally demanding. The most prominent class of PQC is that of schemes based on structured lattices, and the underlying computational workload is the multiplication of polynomials of large degree and low coefficient precision This problem has been studied extensively, and two approaches prevailed: Multiplication via the Toom-Cook-Karatsuba algorithms [Too[63], Coo[66], KO62], and multiplication via the Number Theoretic Transform (NTT). We implement the striding Toom-Cook/Karatsuba and a 32-bit degree-256 negacyclic NTT on the Cortex-M55 processor, based on the Helium instruction set. The Cortex-M4 code will be made available on the Saber repository https://github.com/KULeuven-COSIC/SABER

Preliminaries

Multiplication by evaluation

Number Theoretic Transform

Modular Arithmetic in Fq

Barrett reduction

Montgomery reduction and multiplication

A primer on vector architectures

Introduction to MVE

Cortex-M55

Memory efficient striding Toom-Cook

Application

Implementation

Schoolbook multiplication

Modular Arithmetic

Forward NTT

Point multiplication for full NTT

Point multiplication for partial NTT

Inverse NTT

Side-Channel resistance

Hashing

Development setup

Polynomial and matrix-vector multiplication

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IACR Transactions on Cryptographic Hardware and Embedded Systems	Publication Date: Nov 19, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Polynomial multiplication on embedded vector architectures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems

Lead the way for us

Similar Papers

Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputers
Ricardo Jesus ... Michèle Weiland
Concurrency and Computation: Practice and Experience | VOL. 35
Ricardo Jesus, et. al.Ricardo Jesus ... Michèle Weiland
14 Aug 2023
Concurrency and Computation: Practice and Experience | VOL. 35

CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme
Shouyi Yin ... Leibo Liu
IACR Transactions on Cryptographic Hardware and Embedded Systems | VOL. -
Shouyi Yin, et. al.Shouyi Yin ... Leibo Liu
19 Nov 2021
IACR Transactions on Cryptographic Hardware and Embedded Systems | VOL. -

RANTT: A RISC-V Architecture Extension for the Number Theoretic Transform
Aydin Aysu ... Emre Karabulut
-
Aydin Aysu, et. al.Aydin Aysu ... Emre Karabulut
01 Aug 2020
01 Aug 2020

Performing SVE Studies using the Arm Instruction Emulator
Miguel Tairum Cruz
-
Miguel Tairum CruzMiguel Tairum Cruz
01 Sep 2018
01 Sep 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Polynomial multiplication on embedded vector architectures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IACR Transactions on Cryptographic Hardware and Embedded Systems