The rapid evolution of voice technology has heightened the need for robust detection systems to distinguish between authentic and tampered speech. Recent competitions have significantly advanced the development of countermeasures against spoofing attacks. However, while advancements in detection technologies have been notable, existing methods often focus on a single type of tampering and language. Our contribution lies in developing an improved model that integrates an enhanced ResNet architecture with an LSTM to improve the detection of tampered audio, particularly in challenging multilingual scenarios. In the experiments, we built a hybrid dataset from self-recording Chinese speech and public VCTK2 English samples, enhanced the ResNet model generalization capabilities, and evaluated our approach using the bilingual dataset. Experiment results demonstrate that the proposed approach achieves a superior performance with an equal error rate of 11.62%, even in the face of bilingual conditions, and, more importantly, outperforms the leading models from ASVSpoof 2021 and ADD 2022 competitions. We also employed advanced tampering techniques, including CycleGAN voice conversion and auto splicing, to simulate real-world tampering scenarios and verify the effectiveness of the proposed approach.
Read full abstract