Deep4SNet: deep learning for fake speech classification

Dora M Ballesteros,Yohanna Rodriguez-Ortega,Diego Renza,Gonzalo Arce

doi:10.1016/j.eswa.2021.115465

Dora M Ballesteros, Yohanna Rodriguez-Ortega + Show 2 more

https://doi.org/10.1016/j.eswa.2021.115465

Copy DOI

Abstract

Fake speech consists on voice recordings created even by artificial intelligence or signal processing techniques. Among the methods for generating false voice recordings are Deep Voice and Imitation. In Deep voice, the recordings sound slightly synthesized, whereas in Imitation, they sound natural. On the other hand, the task of detecting fake content is not trivial considering the large number of voice recordings that are transmitted over the Internet. In order to detect fake voice recordings obtained by Deep Voice and Imitation, we propose a solution based on a Convolutional Neural Network (CNN), using image augmentation and dropout. The proposed architecture was trained with 2092 histograms of both original and fake voice recordings and cross-validated with 864 histograms. 476 new histograms were used for external validation, and Precision (P) and Recall (R) were calculated. Detection of fake audios reached P=0.997,R=0.997 for Imitation-based recordings, and P=0.985,R=0.944 for Deep Voice-based recordings. The global accuracy was 0.985. According to the results, the proposed system is successful in detecting fake voice content.

Full Text