ENIGMA: Low-Latency and Privacy-Preserving Edge Inference on Heterogeneous Neural Network Accelerators

Qiushi Li,Ju Ren,Xinglin Pan,Yaoxue Zhang,Yuezhi Zhou

doi:10.1109/icdcs54860.2022.00051

Abstract

Time-efficient artificial intelligence (AI) service has recently witnessed increasing interest from academia and industry due to the urgent needs in massive smart applications such as self-driving cars, virtual reality, high-resolution video streaming, etc. Existing solutions to reduce AI latency, like edge computing and heterogeneous neural-network accelerators (NNAs), face high risk of privacy leakage. To achieve both low-latency and privacy-preserving purposes on edge servers (e.g., NNAs), this paper proposes ENIGMA that can exploit the trusted execution environment (TEE) and heterogeneous NNAs of edge servers for edge inference. The low-latency is supported by a new ahead-of-time analysis framework for analyzing the linearity of multilayer neural networks, which automatically slices forward-graph and assigns sub-graphs to TEE or NNA. To avoid privacy leakage issue, we then introduce a pre-forwarded cipher generation (PFCG) scheme for computing linear sub-forward-graphs on NNA. The input data is encrypted to ciphertext that can be computed directly by linear sub-graphs, and the output can be decrypted to obtain the correct output. To enable non-linear computation of sub-graphs on TEE, we use ring-cache and automatic vectorization optimization to address the memory limitation of TEE. Qualitative analysis and quantitative experiments on GPU, NPU and TPU demonstrate that ENIGMA is not only compatible with heterogeneous NNAs, but also can avoid leakages of private features with latency as low as 50-milliseconds.

Full Text