Abstract

Speaker recognition based on deep learning is currently the most advanced and mainstream technology in the industry. Adversarial attacks, an emerging and powerful attack against neural network models, were first applied in the image domain and gradually expanded to other domains, also posing serious security problems for speaker recognition. Common gradient-based attack methods such as FGSM, PGD, and MI-FGSM can deceive speaker recognition models with high confidence, yet their carefully crafted adversarial examples suffer from poor stealthiness and are easily perceived by the human ear. To improve the stealthiness of the adversarial examples, this paper proposes a new attack method called the Adaptive Decay Attack (ADA), which is applied to three different scenarios in speaker recognition. The method takes the set number of iterations as the termination condition, automatically adjusts the size of the maximum perturbation according to whether the attack is successful or not, and then uses the decay methods in learning rates such as exponential decay and cosine annealing to continuously reduce the step size. The experimental results show that under the two speaker recognition models x-vector, and i-vector, the proposed attack method improves the stealthiness metrics such as SNR and PESQ by at least 30% and 39%, respectively, compared with the best PGD attack under speaker identification of untargeted attacks. For the speaker identification task with targeted attacks, the average improvement is at least 20% and 25% compared to PGD. For the speaker verification task, the improvement is at least 29.5% and 33.4% compared to PGD. In addition, we also use this attack method for adversarial training to enhance the robustness of the model. Experimental results show that ADA-based adversarial training takes 28.31% less time than PGD-based adversarial training, and its improved robustness is generally superior to PGD-based adversarial training. Specifically, the attack success rate of PGD and ADA methods decreased from 50.88% to 36.47% and 64.74% to 45.82%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call