Abstract Automatic Speaker Verification (ASV) technology is increasingly being used in end-user applications to secure access to personal data, smart services, and physical infrastructure. Speaker verification, like other biometric technologies, is vulnerable to spoofing attacks. An attacker impersonates a specific target speaker using impersonation, replay, Text-to-Speech (TTS), or Voice conversion (VC) techniques to gain unauthorized access to the system. The work in this paper, proposes a solution that uses an amalgamation of Cochleagram and Residual Network (ResNet) to implement the front-end feature extraction phase of an Audio Spoof Detection (ASD) system. Cochleagram generation, feature extraction-dimensionality reduction and classification are the three main phases of the proposed ASD system. In the first phase, the recorded audios have been converted into Cochleagrams by using Equivalent Rectangular Bandwidth (ERB) based gammatone filters. In the next phase, three variants of Residual Networks (ResNet), ResNet50, ResNet41 and ResNet27, one by one, have been used for extracting dynamic features. These models yield 2048, 1024 and 256 features, respectively, for a single audio. The feature extracted from ResNet50 and ResNet41 are input to LDA technique for dimensionality reduction. At last, in the classification phase, the LDA reduced features have been used for training four different machine learning classifiers Random Forest, Naïve Bayes, K-Nearest Neighbour (KNN), and eXtreme Gradient Boosting (XGBoost), individually. The proposed work in this paper concentrates on synthetic, replay, and deepfake attacks. The state-of-the-art ASVspoof 2019 Logical Access (LA), Physical Access (PA), Voice Spoofing Detection Corpus (VSDC) and DEepfake CROss-lingual (DECRO) datasets are utilised for training and testing the proposed ASD system. Additionally, we have assessed the performance of our proposed system under the influence of additive noise. Airplane noise at different SNR rate (0, dB 5 dB, 10 dB and −5 dB) has been added to training and testing audios for the same. From the obtained results, it can be concluded that combination of Cochleagram and ResNet50 with XGBoost classifier outperforms all other implemented systems for detecting fake audios under noisy environment. We also tested the proposed models in an unseen scenario, where they demonstrated reasonable performance.
Read full abstract