Abstract

The natural-sounding speech produced by recent text-to-speech and voice conversion techniques pose serious threats to automatic speaker verification systems. The majority of existing spoofing detection countermeasures perform well when the nature of the attacks is known during training. However, their performance in realistic applications degrades in dealing with unseen types of attacks. To address this concern, we propose a novel method for spoof detection, namely Dual Path Res2Net (DP-Res2Net) to improve the robustness to unknown attacks. As to the feature engineering, we employ the time domain features rather than the commonly-used frequency domain ones. We directly input the time domain features of 80,000 sampling points into the network. The input features are further processed by shallow feature learning module, interactive feature learning module, deep feature learning module as well as the discriminator network. The dual-path residual-like block exploit the dependence between successive pieces of audios with large receptive fields. Furthermore, the proposed DP-Res2Net significantly improves the model’s generalizability to unseen spoofing attacks. We evaluate the performance of the proposed method over public-available ASVspoof 2019 logic access evaluation set, and the results demonstrate that it outperforms state-of-the-art audio spoof detection models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call