Erlang Spectrogram and Residual Network-Based Features for Fake Audio Detection

Nidhi Chakravarty,Mohit Dua

doi:10.1080/03772063.2025.2453882

Nidhi Chakravarty, Mohit Dua

https://doi.org/10.1080/03772063.2025.2453882

Copy DOI

Export

Save

Cite

Journal: IETE Journal of Research

Publication Date: Jan 23, 2025

Abstract
Full-Text
Similar Papers

Abstract

Listen

Advancements in deep learning and other approaches are making synthetic speech more natural-sounding. These advancements enable people to train a speech synthesizer with a target voice, resulting in a model that can imitate someone's voice with excellent accuracy. To overcome these challenges, in this work, we are presenting a fusion of Spectrogram and Deep Convolutional Neural Network (DeepCNN) for detecting fake audio. The work has been implemented in three phases: Spectrogram generation, feature extraction & reduction, and classification. In the first step, Erlang spectrograms have been generated using the proposed Erlang filter bank. These generated spectrograms are then passed to the Residual Network (ResNet11) for feature extraction, and later these extracted features are fed to the Linear Discriminant Analysis (LDA) for dimensionality reduction. In the last phase, LDA-optimized features were used to train the four machine learning classifiers, one by one, eXtreme Gradient Boosting (XGboost), Naïve Bayes (NB), KNN (K-Nearest Neighbor), and Random Forest (RF). The proposed work has been evaluated on the complete Fake or Real (FoR) dataset for detecting fake audio in different scenarios. The combination of Erlang spectrogram-ResNet11-LDA with XGBoost has achieved a minimum 0% and 1.4% EER for FoR original and FoR 2sec.

Full Text