Waveform level adversarial example generation for joint attacks against both automatic speaker verification and spoofing countermeasures

Xingyu Zhang,Xiongwei Zhang,Wei Liu,Xia Zou,Meng Sun,Jian Zhao

doi:10.1016/j.engappai.2022.105469

Abstract

Adversarial examples crafted to deceive Automatic Speaker Verification (ASV) systems have attracted a lot of attention when studying the vulnerability of ASV. However, real-world ASV systems usually work together with spoofing countermeasures (CM) to exclude fake voices generated by text-to-speech (TTS) or voice conversion (VC). The deployment of CM would reduce the capability of the adversarial samples on deceiving ASV. Although additional perturbations against CM may be generated and put on the crafted adversarial examples against ASV to yield new adversarial examples against both ASV and CM, those additional perturbations would however hinder the examples’ adversarial effectiveness on ASV. In this paper, a novel joint approach is proposed to generate adversarial examples by considering attacking ASV and CM simultaneously. For any voice from TTS, VC or a real-world speaker, our crafted adversarial perturbations will turn its original labels on CM and speaker ID to bonafide and some target speaker ID, correspondingly. In our approach, a differentiable front-end is introduced to replace the conventional hand-crafted time–frequency feature extractor. Perturbations can thus be estimated by updating the gradients of the joint objective of ASV and CM on the waveform variables. The proposed method has demonstrated a 99.3% success rate on white-box logical access attacks to deceive ASV and CM simultaneously, which outperforms the baselines of 65.3% and 36.7%. Furthermore, transferability on black-box and physical settings has also been validated.

Full Text