Cross-modal information fusion for voice spoofing detection

Junxiao Xue,Hao Zhou,Huawei Song,Bin Wu,Lei Shi

doi:10.1016/j.specom.2023.01.001

Abstract

In recent years, speaker verification systems have been used in many production scenarios. Unfortunately, they are still very vulnerable to different kinds of spoofing attacks, such as speech synthesis attacks, replay attacks, etc. Researchers have proposed many methods to defend against these attacks, but in the existing methods, researchers just focus on speech features. In recent studies, researchers have found that speech contains a large amount of face information. In fact, we can determine the speaker’s gender, age, mouth shape, and other information by voice. These information can help us distinguish spoofing attacks. Inspired by this phenomenon, we propose a generalized framework named GACMNet. To cope with different attack scenarios, we instantiated two different models. Our framework is mainly divided into data pre-processing phase, feature extraction phase, feature fusion phase, and classification phase. Specifically, our framework consists of two branches. On the one hand, we extract face features in speech by a convolutional neural network. On the other hand, we use a densely connected network to extract speech features. For the more, we designed a global attention-based information fusion mechanism to distinguish the importance of each part of the features. Our solution was proven to be effective in two large scenarios. Compared to the existing methods, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 9% and 11% in the logical access scenario, respectively, our model improves the EER score by 10% in the physical access scenario.

Full Text