Abstract

With the rapid development of intelligent speech technologies, automatic speaker verification (ASV) has become one of the most natural and convenient biometric speaker recognition approaches. However, most state-of-the-art ASV systems are vulnerable to spoofing attack techniques, such as speech synthesis, voice conversion, and replay speech. Due to the symmetry distribution characteristic between the genuine (true) speech and spoof (fake) speech pair, the spoofing attack detection is challenging. Many recent research works have been focusing on the ASV anti-spoofing solutions. This work investigates two types of new acoustic features to improve the performance of spoofing attacks. The first features consist of two cepstral coefficients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals. The second feature is a harmonic and noise subband ratio feature, which can reflect the interaction movement difference of the vocal tract and glottal airflow of the genuine and spoofing speech. The significance of these new features has been investigated in both the t-stochastic neighborhood embedding space and the binary classification modeling space. Experiments on the ASVspoof 2019 database show that the proposed residual features can achieve from 7% to 51.7% relative equal error rate (EER) reduction on the development and evaluation set over the best single system baseline. Furthermore, more than 31.2% relative EER reduction on both the development and evaluation set shows that the proposed new features contain large information complementary to the source acoustic features.

Highlights

  • Received: 8 November 2021In recent years, the performances of automatic speaker verification (ASV) systems have been significantly improved

  • Different from previous ASVSpoof challenges, the 2019 year challenge is the first to focus on countermeasures for all three major attack types, namely those stemming from to-speech synthesis (TTS), voice conversion (VC), and replay spoofing attacks

  • harmonic and noise subband ratio feature (HNSR) features are much worse than other features with Gaussian mixture model (GMM) classifier, we hope that it may provide some complementary information to other acoustic features during the system fusion, because the HNSRs, residual CQCC (RCQCC), residual LFCC (RLFCC) are extracted in a totally different way, they may capture the spoofing acoustic characteristics of speech synthesis, voice conversion in different aspects

Read more

Summary

Introduction

The performances of automatic speaker verification (ASV) systems have been significantly improved. Many studies in recent years have found that the state-of-the-art ASV systems fail to handle the speech spoofing attacks, especially for three major types of attacks: the replay [1], speech synthesis [2,3], and voice conversion. Both the speech synthesis and voice conversion techniques can produce high-quality spoofing signals to mimic the ASV systems [2]. The most recent ASVSpoof 2019 was the first to consider the replay, speech synthesis, and voice conversion attacks within a single challenge [2]. Compared with the typical acoustic features that are directly extracted from the bonafide speech, we speculate the cepstral features extracted from the residual signals can capture better discrimination between the bonafide and spoofed speech, either for the synthetic or voice converted attacks or for the replayed attacks. To verify the complementarity between different features, the score-level fusion system is proposed

Related Work
Classifiers
Features
Linear Prediction Residual Modeling and Analysis
Conventional Cepstral Features
CQCC Features
Residual Cepstral Coefficients
Harmonic and Noise Interaction Features
Dataset-ASVSpoof 2019
Evaluation Metrics
Detection Results of Single Systems
System Fusion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call