Label-Only Model Inversion Attacks: Attack With the Least Information

Tianqing Zhu,Bo Liu,Shuai Zhou,Dayong Ye,Wanlei Zhou

doi:10.1109/tifs.2022.3233190

Abstract

In a model inversion attack, an adversary attempts to reconstruct the training data records of a target model using only the model’s output. In launching a contemporary model inversion attack, the strategies discussed are generally based on either predicted confidence score vectors, i.e., black-box attacks, or the parameters of a target model, i.e., white-box attacks. However, in the real world, model owners usually only give out the predicted labels; the confidence score vectors and model parameters are hidden as a defense mechanism to prevent such attacks. Unfortunately, we have found a model inversion method that can reconstruct representative samples of the target model’s training data based only on the output labels. We believe this attack requires the least information to succeed and, therefore, has the best applicability. The key idea is to exploit the error rate of the target model to compute the median distance from a set of data records to the decision boundary of the target model. The distance is then used to generate confidence score vectors which are adopted to train an attack model to reconstruct the representative samples. The experimental results show that highly recognizable representative samples can be reconstructed with far less information than existing methods.

Full Text