Examining the voice verification system resistance developed for banking to attacks employing the voice cloning

Andrzej Czyzewski

doi:10.1121/10.0022861

Abstract

The developed system for verifying the speaker identity in banking branches is based on voice biometrics employing DeepSpeaker neural network model. The testing of its resistance to a potential attack using voice cloning also employed neural networks, such as SV2TTS, Tacotron, WaveRNN, and GE2E. Subjective listening tests indicated that people might easily get confused and point out cloned recordings as an original sample. Nearly 50% of respondents pointed to the wrong answer, confusing the synthesized recording with the original one. Subjects in most cases declared that the quality of cloned and original recordings is similar (Good or Fair according to ITU-R BS.1534 recommendation) and, in some cases, even better than the original (graded as Good for synthesized sample, Fair for original one). Meanwhile, nearly all verification attempts with cloned samples failed (98.8% of samples were rejected). It proves that voice biometrics based on deep neural networks can identify cloned samples better than human listeners. Methods and results of testing the resistance of the developed voice biometrics system to voice cloning attacks are discussed in thepaper. This research was funded from the budget of project No.POIR.01.01.01-0092/19 subsidized by the Polish National Centre for Research and Development (NCBR).

Full Text