Abstract

Detection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.

Highlights

  • Building the robust spoof detection system for Automatic Speaker Verification (ASV) is an essential task, as the attention and demand for voice protected authentication systems is increasing in the users of smart devices

  • Equal Error Rate (EER) for both the arrangements is found out to compare the performances of the feature sets

  • For evaluation in case of ASV systems, EER is the used evaluation protocol that is applied on the classification results of the model for spoof detection task [10, 15, 26]

Read more

Summary

Introduction

Building the robust spoof detection system for Automatic Speaker Verification (ASV) is an essential task, as the attention and demand for voice protected authentication systems is increasing in the users of smart devices. Utilization of this factor for frontend development of speech driven devices [18, 19] can be done by using All Pole Group Delay Function (APGDF), Modified Group Delay Function (MODGDF), etc Both static and dynamic coefficients of speech features deliver the information of context and speaker specification information. These coefficients are passed to the backend spoof detection model. The proposed work in this paper exploits a hybrid of static and dynamic CQCC features for developing the frontend It presents performance comparison of static and static–dynamic CQCC features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. The rest of the paper is organized as: second section discusses the related work third section of the paper discusses the proposed method, the experimental setup details and results are presented in fourth section, fifth section explains the performance analysis of proposed models and systems sixth section compares proposed systems with existing systems and seventh section concludes the proposal with dropping some light on future directions

Related works
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call