Abstract

Reduced distinguishability significantly challenges the detection of Model Stealing (MS). Existing methods for identifying MS attacks exhibit key limitations: (1) Sample-level detection methods that use fixed feature thresholds often inadvertently include benign samples or overlook malicious ones; (2) Distribution-level detection methods with static divergence benchmarks may misclassify benign query samples that deviate from these benchmarks This paper introduces GuardNet, an innovative model stealing detection method. By combining boundary features with inter-sample distance features, GuardNet more precisely identifies malicious sample pairs and employs distribution divergences to adjust decision thresholds, thus enhancing its detection capabilities. The method incorporates a variational autoencoder to reconstruct query samples and uses the Wasserstein distance between pre- and post-reconstruction samples as a measure of distribution divergences, effectively minimizing the influence of distribution shifts on benign query samples. Experimental results indicate that this approach significantly reduces the number of adversarial queries and markedly decreases false positives.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call