High Throughput Acceleration of Scabbard Key Exchange and Key Encapsulation Mechanism Using Tensor Core on GPU for IoT Applications

Muhammad Asfand Hafeez,Wai-Kong Lee,Angshuman Karmakar,Seong Oun Hwang

doi:10.1109/jiot.2023.3282255

Abstract

High throughput key encapsulations and decapsulations are needed by IoT applications in order to simultaneously process a multitude of small data in secure communication. In this paper, we present two novel techniques for accelerating the implementation of polynomial convolution on a GPU, utilizing advanced Tensor cores, which benefit the performance of key encapsulations. First, a polynomial re-structuring technique is proposed to allow several polynomials with distinct public keys to be processed in a single communication cycle. This is an improvement compared to the previous work by Lee et al. Next, we observe that polynomial convolution in some key encapsulation mechanisms contains reduction patterns that are not friendly to parallel implementation. We propose separating the multiplication and reduction processes so they can be parallelized independently. To verify the effectiveness of our proposed techniques, we applied it to two key-encapsulation mechanisms from the Scabbard post-quantum key-encapsulation mechanism suite and evaluate their performance. Experimental results show that polynomial convolution using Tensor cores is 1.05× faster (for the Florete scheme) and 3.6× faster (for the Sable scheme) than using CUDA core-based multiplication with conventional cores on a GPU. The Tensor cores based encapsulations and decapsulations are faster than a reference implementation on a CPU supporting AVX2 by more than 5.6× and 6.4×, respectively, for the Florete scheme and 8.3× and 13.3× faster, respectively, for the Sable scheme. This shows that the proposed techniques can achieve significantly higher throughput for key exchange and encapsulation mechanisms, which are important for securing IoT applications.

Full Text