Abstract

Synthetic speech is becoming increasingly rampant, and automatic speaker verification (ASV) systems are vulnerable to its attacks. However, most current synthetic speech detection methods focus on the influence of a single feature in the detection. Since different features can represent the difference between real speech and synthetic speech to a certain extent, there must be common information between different types of features. Effectively finding and fully utilizing this information will facilitate the extraction of better discriminative features and achieve improved performance. Based on the above analysis, we propose a deep correlation network (DCN) to learn the latent common information between different embeddings. It consists of two parts, the bi-parallel network and the correlation learning network. Bi-parallel networks consist of different neural models to learn the middle-level representations from front-end acoustical features. The correlation learning network is the core part of the DCN and is proposed to explore the common information between the above middle-level features. The common information obtained after DCN processing have better discriminative ability for synthetic speech detection. Experimental results show that the proposed DCN can significantly improve the performance of synthetic speech detection system on ASVspoof 2019 and ASVspoof 2021 logical access sub-challenge.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call