Abstract
Advancements in deep learning and other approaches are making synthetic speech more natural-sounding. These advancements enable people to train a speech synthesizer with a target voice, resulting in a model that can imitate someone's voice with excellent accuracy. To overcome these challenges, in this work, we are presenting a fusion of Spectrogram and Deep Convolutional Neural Network (DeepCNN) for detecting fake audio. The work has been implemented in three phases: Spectrogram generation, feature extraction & reduction, and classification. In the first step, Erlang spectrograms have been generated using the proposed Erlang filter bank. These generated spectrograms are then passed to the Residual Network (ResNet11) for feature extraction, and later these extracted features are fed to the Linear Discriminant Analysis (LDA) for dimensionality reduction. In the last phase, LDA-optimized features were used to train the four machine learning classifiers, one by one, eXtreme Gradient Boosting (XGboost), Naïve Bayes (NB), KNN (K-Nearest Neighbor), and Random Forest (RF). The proposed work has been evaluated on the complete Fake or Real (FoR) dataset for detecting fake audio in different scenarios. The combination of Erlang spectrogram-ResNet11-LDA with XGBoost has achieved a minimum 0% and 1.4% EER for FoR original and FoR 2sec.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have