Abstract

Backdoor attack aims to compromise clean models without arousing suspicion, in which poisoned models behave normally for clean inputs yet return adversary-desired results when triggers appear. Due to the great insidiousness and hazard of backdoor attacks, backdoor defences have been attracting a lot of attention in the machine learning security community. Apart from most backdoor mitigation defences, our defence aims to determine whether the prediction of the classifier is trustworthy. More specifically, we scrutinize whether the prediction result is determined by the adversary-defined trigger or the semantic information of an input. To accomplish this goal, we devise a novel algorithm named feature aggregation, which requires only benign inputs and aims to separate the feature representation distributions of poisoned inputs from those of benign ones. The feature aggregation minimizes the distance between intra-benign feature representations and maximizes the distance between benign and poisoned feature representations. Then, we employ flow-based probability density estimation to model the distribution of benign feature representations. Since the likelihood of poisoned inputs over the estimated distribution is significantly smaller than those of benign ones, they can be identified based on an adaptive threshold. Experimental results show that our method outperforms state-of-the-art defences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call