In recent years, logic-manipulated speech detection has predominantly relied on single-feature and single-channel networks, which fail to consider a broader range of acoustic features. To address this limitation, we propose a novel approach based on a dual-branch network that incorporates fused Mel features for detecting logic-manipulated speech. The structure of our network consists of three main components. Firstly, we extract Mel spectrograms and Mel-frequency cepstral coefficients (MFCC) to address the deficiencies caused by the reduction in the information dimension. Then, to better utilize these two types of features, we construct a dual-branch network framework that enables the separate learning of temporal and frequency domain characteristics. Finally, at the backend of the network, we fuse the feature vectors for decision-making. Experimental evaluations conducted on the ASVspoof2019LA dataset demonstrate the effectiveness and superiority of our proposed method compared to state-of-the-art models and other acoustic features.
Read full abstract