The need for Voice Liveness Detection (VLD) has emerged particularly for the security of Automatic Speaker Verification (ASV) systems. Existing Spoofed Speech Detection (SSD) systems rely on attack-specific approaches to detect spoofed speech. However, to safeguard ASV systems against all the kinds of spoofing attacks (known as well as unknown attacks), determining whether a speech is uttered live (genuine) or not, is important. To that effect, in this work, we propose the detection of pop noise using Morse wavelet for VLD task. Pop noise is a discriminative acoustic cue that is present in live speech and is absent/diminished in spoofed speech. It is captured by the microphone in the form of sudden bursts of air from a live speaker’s mouth due to the close proximity of the speaker with the microphone. To validate this hypothesis, we present an analysis of pop noise energy w.r.t. distance and found that it decreases exponentially with distance. Furthermore, pop noise is said to be present in very low frequency regions. To capture the pop noise effectively, we propose to exploit the excellent frequency resolution of Continuous Wavelet Transform (CWT) using Generalized Morse Wavelets (GMWs). GMWs are a superfamily of analytic wavelets. To that effect, in this work, we have analysed the suitability of GMWs for pop noise detection for VLD task using the POp noise COrpus (POCO). The wavelet parameters are fine-tuned according to the VLD task. Furthermore, the performance of VLD system is evaluated for various subband frequencies, and it is observed that the subband of 1 to 50Hz gives the best performance accuracy of 90.55% and 88.43% on the Dev and Eval sets, respectively. In addition, phoneme-based analysis shows the dependence of the performance of the VLD system on the type of phonemes in the utterances. It is shown that phonemes, such as plosives and fricatives show distinct pop noise as compared to other phonemes. Furthermore, the extension of the POCO dataset is used for experiments where simulated reverberation is added to spoofed signals, assuming the attacker (or the recording device) is positioned at various distances. This leads to the studying the effect of speaker-attacker distance. Similar to the previous results, it is observed that for the reverberated case too, the optimal frequency subband for VLD task is 1 to 50Hz, across all the distances. Furthermore, the proposed feature set is evaluated using three classifiers, namely, Convolutional Neural Network (CNN), Light CNN (LCNN), and Residual Neural Network (ResNet), for POCO dataset as well as reverberated POCO dataset. It is observed that CNN gives the highest accuracy of 88.43% on Eval set of the POCO dataset. Furthermore, the proposed features are also evaluated under the assumptions of two ideal scenarios — when the ASV system is strictly under attack, and when it is strictly not under attack. It is observed that the proposed Morse wavelet-based VLD system rejected 89% of the spoofed utterances, and accepted 88.30% of the genuine utterances.
Read full abstract