Abstract

The rapid evolution of voice technology has heightened the need for robust detection systems to distinguish between authentic and tampered speech. Recent competitions have significantly advanced the development of countermeasures against spoofing attacks. However, while advancements in detection technologies have been notable, existing methods often focus on a single type of tampering and language. Our contribution lies in developing an improved model that integrates an enhanced ResNet architecture with an LSTM to improve the detection of tampered audio, particularly in challenging multilingual scenarios. In the experiments, we built a hybrid dataset from self-recording Chinese speech and public VCTK2 English samples, enhanced the ResNet model generalization capabilities, and evaluated our approach using the bilingual dataset. Experiment results demonstrate that the proposed approach achieves a superior performance with an equal error rate of 11.62%, even in the face of bilingual conditions, and, more importantly, outperforms the leading models from ASVSpoof 2021 and ADD 2022 competitions. We also employed advanced tampering techniques, including CycleGAN voice conversion and auto splicing, to simulate real-world tampering scenarios and verify the effectiveness of the proposed approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.