Categorical Inference Poisoning: Verifiable Defense Against Black-Box DNN Model Stealing Without Constraining Surrogate Data and Query Times

Haitian Zhang,Xinya Wang,Wen Yang,Hao Jiang,Guang Hua

doi:10.1109/tifs.2023.3244107

Abstract

Deep Neural Network (DNN) models have offered powerful solutions for a wide range of tasks, but the cost to develop such models is nontrivial, which calls for effective model protection. Although black-box distribution can mitigate some threats, model functionality can still be stolen via black-box surrogate attacks. Recent studies have shown that surrogate attacks can be launched in several ways, while the existing defense methods commonly assume attackers with insufficient in-distribution (ID) data and restricted attacking strategies. In this paper, we relax these constraints and assume a practical threat model in which the adversary not only has sufficient ID data and query times but also can adjust the surrogate training data labeled by the victim model. Then, we propose a two-step categorical inference poisoning (CIP) framework, featuring both poisoning for performance degradation (PPD) and poisoning for backdooring (PBD). In the first poisoning step, incoming queries are classified into ID and (out-of-distribution) OOD ones using an energy score (ES) based OOD detector, and the latter are further classified into high ES and low ES ones, which are subsequently passed to a strong and a weak PPD process, respectively. In the second poisoning step, difficult ID queries are detected by a proposed reliability score (RS) measurement and are passed to PBD. In doing so, the first step OOD poisoning leads to substantial performance degradation in surrogate models, the second step ID poisoning further embeds backdoors in them, while both can preserve model fidelity. Extensive experiments confirm that CIP can not only achieve promising performance against state-of-the-art black-box surrogate attacks like KnockoffNets and data-free model extraction (DFME) but also work well against stronger attacks with sufficient ID and deceptive data, better than the existing dynamic adversarial watermarking (DAWN) and deceptive perturbation defense methods. PyTorch code is available at https://github.com/Hatins/CIP_master.git.

Full Text