Abstract

Technological progress and proliferation of sophisticated software has made it easier than ever to spoof a person’s voice and audio in general. Like other biometrics, speaker verification is vulnerable to spoofing attacks. Detecting these attacks using the artifacts present in the recordings is a major challenge. Current trend in spoofing detection is to employ deep learning architectures to perform end-to-end detection by employing a pooling layer which aggregates the frame-level information into utterance-level embeddings. To do so, only the first or first and second order statistics are normally pooled across temporal dimension. In this paper, we investigate the influence of higher order statistics, such as third and fourth order moments, on spoofing detection performance. A Time Delay Neural Network (TDNN) architecture is used on the top of linear frequency cepstral coefficients for carrying out spoofing detection experiments on the ASVspoof2019 challenge logical access and physical access corpora. Experiments results, in terms of equal error rate (EER) and minimum tandem detection cost function (min-tDCF), show that inclusion of higher order statistics is accommodating for improving the performance of spoofing detection systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call