Abstract
Synthetic speech is becoming increasingly rampant, and automatic speaker verification (ASV) systems are vulnerable to its attacks. However, most current synthetic speech detection methods focus on the influence of a single feature in the detection. Since different features can represent the difference between real speech and synthetic speech to a certain extent, there must be common information between different types of features. Effectively finding and fully utilizing this information will facilitate the extraction of better discriminative features and achieve improved performance. Based on the above analysis, we propose a deep correlation network (DCN) to learn the latent common information between different embeddings. It consists of two parts, the bi-parallel network and the correlation learning network. Bi-parallel networks consist of different neural models to learn the middle-level representations from front-end acoustical features. The correlation learning network is the core part of the DCN and is proposed to explore the common information between the above middle-level features. The common information obtained after DCN processing have better discriminative ability for synthetic speech detection. Experimental results show that the proposed DCN can significantly improve the performance of synthetic speech detection system on ASVspoof 2019 and ASVspoof 2021 logical access sub-challenge.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.