Abstract

The domain of Automatic Speaker Verification (ASV) is blooming with growing developments in feature en-gineering and artificial intelligence. Inspite of this, the system is liable to spoofing attacks in the form of synthetic or replayed speech. The difficulty in detecting synthetic speech is due to recent advancements in the Voice conversion and Text-to-speech systems which produce natural, indistinguishable speech. To prevent such attacks, there is a need to develop robust spoof detection systems. In order to achieve this goal, we are proposing estimation of Glottal Flow Parameters (GFP) from speech of genuine speech and synthetic spoof samples. The GFP are further parameterized using time, frequency and Liljencrants–Fant (LF) models. Along with GFP features, the Linear Prediction Cepstrum Co-efficient (LFCC) and statistical parameters are computed. The GFP features are investigated to prove their usefulness in detecting spoofed and genuine speech. The ASV spoof 2019 corpus is used to test the framework and evaluated against the baseline models. The proposed spoof detection framework produces an Equal Error Rate (EER) of 2.39% and tandem Detection Cost Function (t-DCF) of 0.0562 which is found to be better than the state-of-the art technique.

Highlights

  • The speaker verification system acknowledges the true identity of a known speaker while dismissing the unknown speaker’s voice [1]

  • The process of binary classification leads to two error types: False Acceptance Ratios (FRR) and the False Rejective Ratios (FRR)

  • The Glottal Flow Parameters (GFP) on the whole when used in the conjunction with Vocal Tract (VT) parameters show improvement in the Equal Error Rate (EER) and tandem Detection Cost Function (t-DCF) when compared to the baseline technique

Read more

Summary

Introduction

The speaker verification system acknowledges the true identity of a known speaker while dismissing the unknown speaker’s voice [1] These systems are bound to be exposed to the infiltrators through spoofing attacks. The intrusion in the form of synthetically generated speech results into spoofing attack on the ASV system Such an environment is termed as Logical Access (LA) scenario while the one with replay speech is a Physical Access (PA) scenario [2]. These attacks are a result of continuous efforts by researchers in field of Voice Conversion (VC) and Text-to-Speech (TTS) [3]; since their aim is to generate clean, human like speech - with little to no variation in the synthetic speech. Most of the research is based on specific type of attack [5], [6] while few others consider all the types of attack making them universal detectors [7], [8]

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call