Synthetic speech detection using temporal modulation feature

Zhizheng Wu,Haizhou Li,Xiong Xiao,Eng Siong Chng

doi:10.1109/icassp.2013.6639067

Zhizheng Wu, Haizhou Li + Show 2 more

https://doi.org/10.1109/icassp.2013.6639067

Copy DOI

Export

Save

Cite

Publication Date: May 1, 2013

Citations: 117

Affiliation: Nanyang Technological University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

Full Text