Abstract

Applications such as public-key cryptography are critically reliant on the speed of modular multiplication for their performance. This paper introduces a new block-based variant of Montgomery multiplication, the Block Product Scanning (BPS) method, which is particularly efficient using new 512-bit advanced vector instructions (AVX-512) on modern Intel processor families. Our parallel-multiplication approach also allows for squaring and sub-quadratic Karatsuba enhancements. We demonstrate 1.9,times improvement in decryption throughput in comparison with OpenSSL and 1.5,times improvement in modular exponentiation throughput compared to GMP-6.1.2 on an Intel Xeon CPU. In addition, we show 1.4,times improvement in decryption throughput in comparison with state-of-the-art vector implementations on many-core Knights Landing Xeon Phi hardware. Finally, we show how interleaving Chinese remainder theorem-based RSA calculations within our parallel BPS technique halves decryption latency while providing protection against fault-injection attacks.

Highlights

  • Modular multiplication of large integers is the computational backbone of many applications: public-key cryptographic schemes such as RSA [26], factorization algorithms such as the elliptic curve method (ECM) [17], or even computations of massively remote hexadecimal digits of π [28]

  • The RSA public-key algorithm is the focus of this paper because (1) increasingly large integers are required for good security and faster implementations of the algorithm are of continuing interest [16]; (2) the commonly used Chinese remainder theorem (CRT) approach to decryption is vulnerable to fault-injection attacks for which countermeasures incur overhead [1]; and (3) the continuing widespread use of RSA in practice

  • The contributions of this paper are (1) the introduction of the parallel Block Product Scanning (BPS) Montgomery multiplication variant, which, as implemented using AVX-512, demonstrates 1.9 times increased decryption throughput compared to OpenSSL and 1.5 × increased modular exponentiation throughput compared to GMP-6.1.2 at 2048-bit modulus sizes and (2) a CRT-RSA implementation strategy whereby exponentiations modulo the two prime factors of the RSA modulus, p and q, are performed simultaneously and interleaved within AVX512 vectors

Read more

Summary

Introduction

Modular multiplication of large integers is the computational backbone of many applications: public-key cryptographic schemes such as RSA [26], factorization algorithms such as the elliptic curve method (ECM) [17], or even computations of massively remote hexadecimal digits of π [28]. We show that the two independent exponentiations associated with CRT-RSA decryptions can be computed simultaneously by interleaving data within the lanes of AVX-512 registers. This intermixing of data in our implementation provides protection against fault-injection attacks with no additional calculation overhead. The contributions of this paper are (1) the introduction of the parallel BPS Montgomery multiplication variant, which, as implemented using AVX-512, demonstrates 1.9 times increased decryption throughput compared to OpenSSL and 1.5 × increased modular exponentiation throughput compared to GMP-6.1.2 at 2048-bit modulus sizes and (2) a CRT-RSA implementation strategy whereby exponentiations modulo the two prime factors of the RSA modulus, p and q, are performed simultaneously and interleaved within AVX512 vectors.

Related work
Block Product Scanning technique
Implementation
BPS multiplication
BPS squaring
CRT-RSA implementation
Benchmarks and comparisons
Method
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call