Abstract
As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting.
Highlights
An automatic speaker verification (ASV) system attempts to verify if a given speech utterance matches the claimed identity [1]
Spoofing countermeasure (CM) systems aim to detect these crafted audio samples, and improve security when combined with an ASV system [3]
Given its success with other metrics, we extend the idea of soft detection cost function (DCF) to tandem detection cost function (t-DCF) to assess its applicability in tandem optimization
Summary
An automatic speaker verification (ASV) system attempts to verify if a given speech utterance matches the claimed identity [1]. Spoofing countermeasure (CM) systems aim to detect these crafted audio samples, and improve security when combined with an ASV system [3]. This improvement is achieved by separately training the two systems, using them in conjunction with each other and evaluating their performance using a tandem detection cost function (t-DCF) [4]. They are evaluated using this tandem metric, the original ASV and CM systems are not trained to minimize the t-DCF. Some attack systems used to generate spoof samples could fool the CM but may be detected by the ASV system, as is the case with system A17 in the ASVspoof dataset [5]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.