Abstract

Speaker recognition (SR) systems play an essential part in many applications, such as speech control access systems and voice verification, where robustness is crucial for daily use. Therefore, the vulnerability of SR systems has become a hot study of interest. Notably, recent studies prove that SR systems are vulnerable to adversarial attacks. These kinds of attacks generate adversarial examples by adding a well-crafted inconspicuous perturbation to the original audio to fool the target model into making false predictions. However, dominant works conduct attacks in the white-box setting, which suffers from limited practices since the model parameters and architectures are usually unavailable in real-world scenarios. To this end, we propose a black-box based framework without requiring details of the target model. We leverage gradient estimation procedure based on natural evolution strategy to generate adversarial examples. The gradient estimation only needs confidence scores and decisions produced by SR systems. We also explore genetic algorithm to guide the direction of example generation, which accelerates model convergence. The experiments demonstrate that our approach can manipulate state-of-the-art SR systems at a high attack success rate of 97.5% with small distortions. Extensive investigations on benchmark datasets, VoxCeleb1, VoxCeleb2, and TIMIT, further verify the effectiveness and stealthiness of our attack method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call