TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU

Shengyu Fan,Dan Meng,Zhiwei Wang,Mingzhe Zhang,Rui Hou,Weizhi Xu

doi:10.1109/hpca56546.2023.10071017

Abstract

In the cloud computing era, privacy protection is becoming pervasive in a broad range of applications (e.g., machine learning, data mining, etc). Fully Homomorphic Encryption (FHE) is considered the perfect solution as it enables privacy-preserved computation on untrusted servers. Unfortunately, the prohibitive performance overhead blocks the wide adoption of FHE (about 10, 000× slower than the normal computation). As heterogeneous architectures have gained remarkable success in several fields, achieving high performance for FHE with specifically designed accelerators seems to be a natural choice. Until now, most FHE accelerators have focused on efficiently implementing one FHE operation at a time based on ASIC and with significantly higher performance than GPU and FPGA. However, recent state-of-the-art FHE accelerators rely on an expensive and large on-chip storage and a high-end manufacturing process (i.e., 7nm), which increase the cost of FHE adoption.In this paper, we propose TensorFHE, an FHE acceleration solution based on GPGPU for real applications on encrypted data. TensorFHE utilizes Tensor Core Units (TCUs) to boost the computation of Number Theoretic Transform (NTT), which is the part of FHE with highest time-cost. Moreover, TensorFHE focuses on performing as many FHE operations as possible in a certain time period rather than reducing the latency of one operation. Based on such an idea, TensorFHE introduces operation-level batching to fully utilize the data parallelism in GPGPU. We experimentally prove that it is possible to achieve comparable performance with GPGPU as with state-of-the-art ASIC accelerators. TensorFHE performs 913 KOPS and 88 KOPS for NTT and HMULT (key FHE kernels) within NVIDIA A100 GPGPU, which is 2.61× faster than state-of-the-art FHE implementation on GPGPU; Moreover, TensorFHE provides comparable performance to the ASIC FHE accelerators, which makes it even 2.9× faster than the F1+ with a specific workload. Such a pure software acceleration based on commercial hardware with high performance can open up usage of state-of-the-art FHE algorithms for a broad set of applications in real systems.

Full Text