Abstract

The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61% and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.