Abstract

Automatic speaker verification (ASV) technology is recently finding its way to end-user applications for secure access to personal data, smart services or physical facilities. Similar to other biometric technologies, speaker verification is vulnerable to spoofing attacks where an attacker masquerades as a particular target speaker via impersonation, replay, text-to-speech (TTS) or voice conversion (VC) techniques to gain illegitimate access to the system. We focus on TTS and VC that represent the most flexible, high-end spoofing attacks. Most of the prior studies on synthesized or converted speech detection report their findings using high-quality clean recordings. Meanwhile, the performance of spoofing detectors in the presence of additive noise, an important consideration in practical ASV implementations, remains largely unknown. To this end, our study provides a comparative analysis of existing state-of-the-art, off-the-shelf synthetic speech detectors under additive noise contamination with a special focus on front-end processing that has been found critical. Our comparison includes eight acoustic feature sets, five related to spectral magnitude and three to spectral phase information. All the methods contain a number of internal control parameters. Except for feature post-processing steps (deltas and cepstral mean normalization) that we optimized for each method, we fix the internal control parameters to their default values based on literature, and compare all the variants using the exact same dimensionality and back-end system. In addition to the eight feature sets, we consider two alternative classifier back-ends: Gaussian mixture model (GMM) and i-vector, the latter with both cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring. Our extensive analysis on the recent ASVspoof 2015 challenge provides new insights to the robustness of the spoofing detectors. Firstly, unlike in most other speech processing tasks, all the compared spoofing detectors break down even at relatively high signal-to-noise ratios (SNRs) and fail to generalize to noisy conditions even if performing excellently on clean data. This indicates both difficulty of the task, as well as potential to over-fit the methods easily. Secondly, speech enhancement pre-processing is not found helpful. Thirdly, GMM back-end generally outperforms the more involved i-vector back-end. Fourthly, concerning the compared features, the Mel-frequency cepstral coefficient (MFCC) and subband spectral centroid magnitude coefficient (SCMC) features perform the best on average though the winner method depends on SNR and noise type. Finally, a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noisy conditions. In particular, simple score averaging fusion, as opposed to weighted fusion with logistic loss weight optimization, was found to work better, on average. For clean speech, it provides 88% and 28% relative improvements over the best standalone features for known and unknown spoofing techniques, respectively. If we consider the best score fusion of just two features, then RPS serves as a complementary agent to one of the magnitude features. To sum up, our study reveals a significant gap between the performance of state-of-the-art spoofing detectors between clean and noisy conditions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.