The essential steps of our study are to quantify and classify the differences between real and fake speech signals. In this scope, the main aim is to use the salient feature learning ability of deep learning in our study. With the use of ensemble classification pipeline, the interpretable logical rules were used for generalized reasoning with the class activation maps to discriminate the different speech classes as correctly. Fake audio samples were generated by using Deep Convolutional Generative Adversarial Neural Network. Our experiments were conducted on three different language dataset such as Turkish, English languages and Bilingual. As a result of higher classification and recognition accuracy with the use of classification pipeline as compiled into a majority voting-based ensemble classifier, the experimental results were obtained for each individual language performance approximately as 90% for training and as 80.33% for testing stages for pipeline, and it reached as 73% for majority voting results considered together with the appropriate test cases as well. To extract semantically rich rules, an interpretable logical rules infrastructure was used to infer the correct fake speech from class activations of deep learning’s generative model. Discussion and conclusion based on scientific findings are included in our study.
Read full abstract