Abstract

Conventional speaker verification systems become frail or incompetent while facing attack from spoofed speech. Presently many anti-spoofing countermeasures have been studied for automatic speaker verification. It has been known that the salient feature is of a more important role rather than the selection of classifiers in the current research field of spoofing detection. The effectiveness of constant-Q transform (CQT) has been demonstrated for anti-spoofing feature analysis in many research literatures on automatic speaker verification. On the basis of CQT-based information sub-features, i.e. octave-band principal information (OPI), full-band principal information (FPI), short-term spectral statistics information (STSSI) and magnitude-phase energy information (MPEI), three concatenated features are proposed by investigating their information complementarity in this paper, the first one is constant-Q statistics-plus-principal information coefficients (CQSPIC) by combining OPI, FPI and STSSI; the second one is constant-Q energy-plus-principal information coefficients (CQEPIC) by combining OPI, FPI and MPEI and the third one is constant-Q energy-statistics-principal information coefficients (CESPIC) by combining OPI, FPI, MPEI and STSSI. In this paper, we set up deep neural network (DNN) classifiers for evaluation of the proposed features. Experiments show that the proposed features can outperform some commonly used features meanwhile the proposed systems give better or comparable performance comparing with state-of-the-art performance on ASVspoof 2019 logical access and physical access corpus.

Highlights

  • Over the past decades, speaker verification technology has gradually matured in commercial products [1], [2]

  • Our experiment shows the octave-band statistics of statistics information (STSSI) is not competent with full-band STSSI statistics for spoofing detection

  • STSSI represents statistics information, it is of substantial complementarity with both octave-band principal information (OPI) and full-band principal information (FPI)

Read more

Summary

INTRODUCTION

Speaker verification technology has gradually matured in commercial products [1], [2]. CQT is used to convert speech from the time domain to the frequency domain; Power Spectrum is used to calculate octave power spectrum; Log is used to obtain octave power spectrum in logarithm scale; Uniform Resampling is used to convert logarithm octave power spectrum into linear logarithm power spectrum; and DCT is used to extract principal information by attempting to decorrelate the intermediate coefficients in order to deduce the dimension. It is believed that the mean and variance of the spectral amplitude distributed over either a long-term period of certain spectrum or a range of frequencies at a time frame can provide good traits to distinguish the two different kinds of speech signals. Different from the CQCC with linearized log power spectrum resampling, the FPI directly applies DCT on logarithm power spectrum in CQT domain so that the human auditory property can be preserved and considered into the feature. On the basis of EM (l) and EP(l), supposing MPEI (l) is MPEI at the frame l, we can obtain

CONCATENATION
CLASSIFIER IN SPOOFING DETECTION
PERFORMANCE EVALUATION
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.