On significance of constant-Q transform for pop noise detection

Kuldeep Khoria,Ankur T Patil,Hemant A Patil

doi:10.1016/j.csl.2022.101421

Abstract

Liveness detection has emerged as an important research issue for many biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine vs. spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg’s uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with cepstral features, namely, Linear Frequency Cepstral Coefficients (LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed on recently released POp noise COrpus (POCO) dataset with various statistical, discriminative, and deep learning-based classifiers, namely, Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Convolutional Neural Networks (CNN), Light-CNN (LCNN), and Residual Network (ResNet), respectively. The significant improvement in performance, in particular, an absolute improvement of 14.23% and 10.95% in terms of percentage classification accuracy on development and evaluation set, respectively, is obtained for the proposed CQT-based algorithm along with SVM classifier, over the STFT-SVM (baseline) system. Similar trend of the performance improvement is observed for the GMM, CNN, LCNN, and ResNet classifiers for the proposed CQT-based algorithm vs. traditional STFT-based algorithm. The analysis is further extended by simulating the replay mechanism (in the standard framework of ASVSpoof-2019 PA challenge dataset) on the subset of POCO dataset in order to observe the effect of room acoustics onto the performance of the VLD system. By embedding the moderate simulated replay mechanism in POCO dataset, we obtained the percentage classification accuracy of 97.82% on evaluation set.

Full Text