Abstract

It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the hroughput) of the SIKE key encapsulation mechanism when executed on Ice Lake CPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AvxSike, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AvxSike instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AvxSike outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster.

Highlights

  • In 2016, the National Institute of Standards and Technology (NIST) became engaged in Post-Quantum Cryptography (PQC) and started an initiative to solicit, evaluate, and standardize quantum-safe public-key cryptographic algorithms [CJL+16]

  • Due to our optimizations for encapsulation, AvxSike-LL reaches a 3.2-fold higher encapsulation speed compared to SIDHv3.4, which can be beneficial for e.g. server-side TLS processing since, when Supersingular Isogeny Key Encapsulation (SIKE) is integrated into TLS, the server has to perform encapsulations

  • By developing sophisticated vector processing techniques for field arithmetic, point arithmetic, and isogeny computations, all of which are integrated into our AvxSike software, we were able to significantly improve both the latency and the throughput of SIKE on modern Intel processors

Read more

Summary

Introduction

In 2016, the NIST became engaged in Post-Quantum Cryptography (PQC) and started an initiative to solicit, evaluate, and standardize quantum-safe public-key cryptographic algorithms [CJL+16]. There exists currently only one publication dealing with AVX-512 optimizations for SIKE, namely the ARITH 2019 paper of Kostic and Gueron [KG19], but their work focuses solely on the low-level field arithmetic, i.e. they did not explore avenues for parallel processing at the higher levels of SIKE It is still unknown how AVX-512IFMA can be exploited to unleash the full potential of modern Intel processors for executing SIKE and what latency At the highest layer, we discuss various approaches for vectorized isogeny computation and key encapsulation All these parallel processing techniques are combined in AvxSike, an optimized implementation of SIKE using Intel’s AVX-512IFMA instructions. Our latency-optimized AvxSike instantiated with the SIKEp503 parameters is about 1.5 times faster than the AVX-512IFMA-based SIKE software presented in [KG19] It outperforms Microsoft’s x64 Assembler implementation of SIKE by a factor of about 2.5 for both key generation and decapsulation, and even 3.2 for encapsulation, when benchmarked on an Intel Core i3-1005G1 processor.

Preliminaries
Optimized Isogeny Computations
Intel Advanced Vector Extension AVX-512
Prime-Field Arithmetic
Radix-251 Representation
Results and Comparison
Quadratic Extension-Field Arithmetic
Fp2-Multiplication
Fp2-Squaring
Fp2-Addition and Subtraction
Montgomery Elliptic Curve Arithmetic
Three-Point Ladder
Point Doubling and Tripling
Isogeny Generation
Isogeny Evaluation
Low-Latency Implementation
High-Throughput Implementation
Experimental Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call