Abstract

There are many studies on detecting human speech from artificially generated speech and automatic speaker verification (ASV) that aim to detect and identify whether the given speech belongs to a given speaker. Recent studies demonstrate the success of the relative phase (RP) feature in speaker recognition/verification and the detection of synthesized speech and converted speech. However, there are few studies that focus on the RP feature for replay attack detection. In this paper, we improve the discriminating ability of the RP feature by proposing two new auditory filter-based RP features for replay attack detection. The key idea is to integrate the advantage of RP-based features in signal representation with the advantage of two auditory filter-based RP features. For the first proposed feature, we apply a Mel-filter bank to convert the signal representation of conventional RP information from a linear scale to a Mel scale, where the modified representation is called the Mel-scale RP feature. For the other proposed feature, a gammatone filter bank is applied to scale the RP information, where the scaled RP feature is called the gammatone-scale RP feature. These two proposed phase-based features are implemented to achieve better performance than a conventional RP feature because of the scale resolution and. In addition to the use of individual Mel/gammatone-scale RP features, a combination of the scores of these proposed RP features and a standard magnitude-based feature, that is, the constant Q transform cepstral coefficient (CQCC), is also applied to further improve the reliable detection decision. The effectiveness of the proposed Mel-scale RP feature, gammatone-scale RP feature, and their combination are evaluated using the ASVspoof 2017 dataset. On the evaluation dataset, our proposed methods demonstrate significant improvement over the existing feature and baseline CQCC feature. The combination of the CQCC and gammatone-scale RP provides the best performance compared with an individual baseline feature and other combination methods.

Highlights

  • Automatic speaker verification (ASV) aims to determine whether the given speech belongs to a given speaker [1]

  • We experimented on three baseline features: the Mel frequency cepstral coefficients (MFCC), constant Q cepstral coefficients (CQCC), and MGD cepstral coefficient (MGDCC)

  • The CQCC performed better than the MFCC as the CQCC had a variable resolution compared with the MFCC, which mainly focuses on low-frequency components

Read more

Summary

Introduction

Automatic speaker verification (ASV) aims to determine whether the given speech belongs to a given speaker [1]. As an anti-spoofing task mainly focuses on the characteristics of the given speech, in this paper, we focus on a new feature extraction method that provides better discrimination between replayed speech and genuine speech. In addition to the aforementioned single features, fusion systems [19, 20] have been proposed for combining scores from different feature/classifier-based replay attack detection. The fused system of LFCC and rectangular filter cepstral coefficients (RFCC) was proposed in [20] In all these works, the best performing system strongly depends on an individual amplitude information-based feature and the score combination, which fuses scores from different features/classifiers derived from amplitude information. In previous work [21, 22], the use of only the magnitude feature was sometimes not sufficient for the discrimination task This is because phase information, which is half of the original speech, is ignored when discriminating between replay and genuine speech

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call