Abstract

With the advancements of speech synthesis technology, audio spoof detection systems have become vital for the security of automatic speaker verification systems. Many effective solutions have been offered for clean speech data. However, additive noise has a detrimental impact on the detection performance, as in many other speech related tasks. Noise mask is one of the methods proposed to increase the robustness against the additive noise. The purpose of the noise mask is to identify the time–frequency regions dominated by the noise signal. In this work, differential convolutional neural network is used to create noise masks. Differential convolution considers directional changes of the activations and generates new feature maps. Compared to the traditional convolutional network, a finer noise mask can be created with this method. Once the differential network for the noise masks is trained, its outputs are given to the spoof detection systems. Linear filterbank magnitudes are used as acoustic features for both noise masks and spoof detection. Therefore, the spoof detection systems have 2-channel inputs, i.e., linear filterbank magnitudes and its corresponding mask. Probabilistic linear discriminant analysis (PLDA) with x-vectors, emphasized channel attention, propagation and aggregation time delay neural network (ECAPA–TDNN), and light convolutional neural network (LCNN) followed by long short-term memory layers (LSTM) were used as classifiers. Three different noise types are used in both training and test stages, and two different noise types are used solely in the test stage, to stimulate seen and unseen conditions, respectively. Experiments conducted on the noisy version of ASVspoof 2015 challenge dataset showed that the LCNN–LSTM network with noise masks can achieve superior performance compared to other robust systems and can compete with the state-of-the-art. Considering the average of the known noise types, 2.67% equal error rate (EER) was observed. For the unknown noise types, 3.10% average EER was achieved. For the original (clean) ASVspoof 2015 data, the EER was 0.83%. Additionally, 2.6% EER was observed for logical access condition of ASVspoof 2019 data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call