Parallel modular multiplication using 512-bit advanced vector instructions

Benjamin Buhrow,Clifton Haider,Barry Gilbert

doi:10.1007/s13389-021-00256-9

Abstract

Applications such as public-key cryptography are critically reliant on the speed of modular multiplication for their performance. This paper introduces a new block-based variant of Montgomery multiplication, the Block Product Scanning (BPS) method, which is particularly efficient using new 512-bit advanced vector instructions (AVX-512) on modern Intel processor families. Our parallel-multiplication approach also allows for squaring and sub-quadratic Karatsuba enhancements. We demonstrate 1.9,times improvement in decryption throughput in comparison with OpenSSL and 1.5,times improvement in modular exponentiation throughput compared to GMP-6.1.2 on an Intel Xeon CPU. In addition, we show 1.4,times improvement in decryption throughput in comparison with state-of-the-art vector implementations on many-core Knights Landing Xeon Phi hardware. Finally, we show how interleaving Chinese remainder theorem-based RSA calculations within our parallel BPS technique halves decryption latency while providing protection against fault-injection attacks.

Highlights

Modular multiplication of large integers is the computational backbone of many applications: public-key cryptographic schemes such as RSA [26], factorization algorithms such as the elliptic curve method (ECM) [17], or even computations of massively remote hexadecimal digits of π [28]
The RSA public-key algorithm is the focus of this paper because (1) increasingly large integers are required for good security and faster implementations of the algorithm are of continuing interest [16]; (2) the commonly used Chinese remainder theorem (CRT) approach to decryption is vulnerable to fault-injection attacks for which countermeasures incur overhead [1]; and (3) the continuing widespread use of RSA in practice
The contributions of this paper are (1) the introduction of the parallel Block Product Scanning (BPS) Montgomery multiplication variant, which, as implemented using AVX-512, demonstrates 1.9 times increased decryption throughput compared to OpenSSL and 1.5 × increased modular exponentiation throughput compared to GMP-6.1.2 at 2048-bit modulus sizes and (2) a CRT-RSA implementation strategy whereby exponentiations modulo the two prime factors of the RSA modulus, p and q, are performed simultaneously and interleaved within AVX512 vectors

Summary

Introduction

Modular multiplication of large integers is the computational backbone of many applications: public-key cryptographic schemes such as RSA [26], factorization algorithms such as the elliptic curve method (ECM) [17], or even computations of massively remote hexadecimal digits of π [28]. We show that the two independent exponentiations associated with CRT-RSA decryptions can be computed simultaneously by interleaving data within the lanes of AVX-512 registers. This intermixing of data in our implementation provides protection against fault-injection attacks with no additional calculation overhead. The contributions of this paper are (1) the introduction of the parallel BPS Montgomery multiplication variant, which, as implemented using AVX-512, demonstrates 1.9 times increased decryption throughput compared to OpenSSL and 1.5 × increased modular exponentiation throughput compared to GMP-6.1.2 at 2048-bit modulus sizes and (2) a CRT-RSA implementation strategy whereby exponentiations modulo the two prime factors of the RSA modulus, p and q, are performed simultaneously and interleaved within AVX512 vectors.

Related work

Block Product Scanning technique

Implementation

BPS multiplication

BPS squaring

CRT-RSA implementation

Benchmarks and comparisons

Method

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cryptographic Engineering	Publication Date: Feb 13, 2021
Citations: 2	License type: open-access

R Discovery Prime

Parallel modular multiplication using 512-bit advanced vector instructions

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Cryptographic Engineering

Lead the way for us

Similar Papers

On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures
Tom Deakin ... Simon Mcintosh-Smith
-
Tom Deakin, et. al.Tom Deakin ... Simon Mcintosh-Smith
01 Jan 2017
01 Jan 2017

Parallel BRDF-based infrared radiation simulation of aerial targets implemented on Intel Xeon processor and Xeon Phi coprocessor
Xing Guo ... Yunhua Cao
Journal of Real-Time Image Processing | VOL. 16
Xing Guo, et. al.Xing Guo ... Yunhua Cao
07 Dec 2017
Journal of Real-Time Image Processing | VOL. 16

Performance Analysis and Optimization of Full Garbage Collection in Memory-hungry Environments
Yang Yu ... Binyu Zang
-
Yang Yu, et. al.Yang Yu ... Binyu Zang
25 Mar 2016
25 Mar 2016

Performance Analysis and Optimization of Full Garbage Collection in Memory-hungry Environments
Yang Yu ... Weihua Zhang
ACM SIGPLAN Notices | VOL. 51
Yang Yu, et. al.Yang Yu ... Weihua Zhang
25 Mar 2016
ACM SIGPLAN Notices | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Parallel modular multiplication using 512-bit advanced vector instructions

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Cryptographic Engineering