Abstract

Faced with the threat of identity leakage during voice data publishing, users are engaged in a privacy-utility dilemma when enjoying the utility of voice services. Existing machine-centric studies employ direct modification or text-based re-synthesis to de-identify users' voices but cause inconsistent audibility for human participants in emerging online communication scenarios, such as virtual meetings. In this paper, we propose a human-centric voice de-identification system, VoiceCloak, which uses adversarial examples to balance the privacy and utility of voice services. Instead of typical additive examples inducing perceivable distortions, we design a novel convolutional adversarial example that modulates perturbations into real-world room impulse responses. Benefiting from this, VoiceCloak could preserve user identity from exposure by Automatic Speaker Identification (ASI), while remaining the voice perceptual quality for non-intrusive de-identification. Moreover, VoiceCloak learns a compact speaker distribution through a conditional variational auto-encoder to synthesize diverse targets on demand. Guided by these pseudo targets, VoiceCloak constructs adversarial examples in an input-specific manner, enabling any-to-any identity transformation for robust de-identification. Experimental results show that VoiceCloak could achieve over 92% and 84% successful de-identification on mainstream ASIs and commercial systems with excellent voiceprint consistency, speech integrity, and audio quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call