Abstract

Countermeasures used to detect synthetic and voice-converted spoofed speech are usually based on excitation source or system features. However, in the natural speech production mechanism, there exists nonlinear source–filter (S–F) interaction as well. This interaction is an attribute of natural speech and is rarely present in synthetic or voice-converted speech. Therefore, we propose features based on the S–F interaction for a spoofed speech detection (SSD) task. To that effect, we estimate the voice excitation source (i.e., differenced glottal flow waveform, $\dot{g} (t))$ and model it using the well-known Liljencrants–Fant model to get coarse structure, $g_{c}(t)$ . The residue or difference, $g_{r}(t)$ , between $\dot{g} (t)$ and $g_{c}(t)$ is known to capture the nonlinear S–F interaction. In the time domain, the $L^{2}$ norm of $g_{r}(t)$ in the closed, open, and return phases of the glottis are considered as features. In the frequency domain, the Mel representation of $g_{r}(t)$ showed significant contribution in the SSD task. The proposed features are evaluated on the first ASVspoof 2015 challenge database using a Gaussian mixture model based classification system. On the evaluation set, for vocoder-based spoofs (i.e., S1 – S9 ), the score-level fusion of residual energy features, Mel representation of the residual signal, and Mel frequency cepstral coefficients (MFCC) features gave an equal error rate (EER) of 0.017%, which is much less than the 0.319% obtained with MFCC alone. Furthermore, the residues of the spectrogram (as well as the Mel-warped spectrogram) of estimated $\dot{g} (t)$ and $g_{c}(t)$ are also explored as features for the SSD task. The features are evaluated for robustness in the presence of additive white, babble, and car noise at various signal-to-noise-ratio levels on the ASVspoof 2015 database and for channel mismatch condition on the Blizzard Challenge 2012 dataset. For both cases, the proposed features gave significantly less EER than that obtained by MFCC on the evaluation set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.