Abstract
Speech anti-spoofing is a crucial aspect of speaker recognition systems and has received a great deal of attention in recent years. Deep neural networks have achieved satisfactory results in datasets with similar training and testing data distributions, but their generalization ability is limited in datasets with different distributions. In this paper, we proposed a novel dual-input hierarchical fusion network (HFN) to improve the generalization ability of our model. The network had two inputs (the original speech signal and the time-reversed signal), which increased the volume and diversity of the training data. The hierarchical fusion model (HFM) enabled more thorough fusion of information from different input levels and improved model performance by fusing the two inputs after speech feature extraction. We finally evaluated the results using the ASVspoof 2021 PA (Physical Access) dataset, and the proposed system achieved an Equal Error Rate (EER) of 24.46% and a minimum tandem Detection Cost Function (min t-DCF) of 0.6708 in the test set. Compared with the four baseline systems in the ASVspoof 2021 competition, the proposed system min t-DCF values were decreased by 28.9%, 31.0%, 32.6%, and 32.9%, and the EERs were decreased by 35.7%, 38.1%, 45.4%, and 49.7%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.