Abstract

In this paper, we present new research results on the vulnerability of speaker verification (SV) systems to synthetic speech. Using a state-of-the-art i-vector SV system and evaluating with the Wall-Street Journal (WSJ) corpus, our SV system has a 0.00% false rejection rate (FRR) and 1.74 × 10 -5 false acceptance rate (FAR). When the i-vector system is tested with state-of-the-art speaker-adaptive, hidden Markov model (HMM)-based synthetic speech generated from speaker models derived from the WSJ journal corpus, 22.9% of the matched claims are accepted highlighting the vulnerability of SV systems to synthetic speech. We propose a new synthetic speech detector (SSD) which uses previously-proposed features derived from image analysis of pitch patterns but extracted on phoneme-level segments and which leverages the available enrollment speech from the SV system. When the SSD is applied to human and synthetic speech accepted by the SV system, the overall system has a FRR of 7.35% and a FAR of 2.34 × 10 -4 which is lower than previously-reported systems and thus significantly reduces the vulnerability.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call