Abstract

In this study, we explore the use of Convolutional Neural Networks (CNN) for replay spoof detection in Automatic Speaker Verification (ASV) system. The Amplitude and Frequency Modulation (AM-FM) feature sets obtained from the Hilbert transform (HT) and Energy Separation Algorithm (ESA) are used as the front end. We have observed the effect of max-pooling and fully connected (FC) layers, when replaced with the convolutional layers in CNN. The results are compared with Gaussian Mixture Model (GMM) classifier, furthermore to obtain the possible complementary information of both the GMM and CNN classifiers, we have explored classifier-level fusion. In addition, we compared our results with Constant-Q Cepstral Coefficients (CQCC) and Mel Frequency Cepstral Coefficients (MFCC) feature sets. The architecture with max-pooling when replaced with convolutional layer along with FC layers had performed relatively better on most of the AM-FM feature sets compared to other CNNs. The ESA-based AM features (i.e., Instantaneous Amplitude Cosine Coefficients (ESA-IACC)) performed better as AM do not have more fluctuation as FM have during models training. The lower EER is obtained with classifier-level fusion of ESA-IACC feature set resulting in 2.54 % EER on development set and 6.04 % on evaluation set of ASVspoof 2017 Challenge database.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call