Abstract

Traditional acoustic-phonetic approach makes use of both spectral and phonetic information when comparing the voice of speakers. While phonetic units are not equally informative, the phonetic context of speech plays an important role in speaker verification (SV). In this paper, we propose a neural acoustic-phonetic approach that learns to dynamically assign differentiated weights to spectral features for SV. Such differentiated weights form a phonetic attention mask (PAM). The neural acoustic-phonetic framework consists of two training pipelines, one for SV and another for speech recognition. Through the PAM, we leverage the phonetic information for SV. We evaluate the proposed neural acoustic-phonetic framework on the RSR2015 database Part III corpus, that consists of random digit strings. We show that the proposed framework with PAM consistently outperforms baseline with an equal error rate reduction of 13.45% and 10.20% for female and male data, respectively.

Highlights

  • S PEAKER verification (SV) seeks to verify the claimed identity of a speaker [1]

  • The 5-digit SV test trials can be grouped into four categories by pairing speaker identity and speech content, namely, Target Correct (TC), Impostor Correct (IC), Target Wrong (TW), and Impostor Wrong (IW), as described in [12]

  • We observe that the acoustic-only baseline achieves a comparable result with other competing systems, and the acoustic-phonetic system with phonetic attention mask (PAM) achieves a performance gain of equal error rate (EER) reduction by 13.45% and 10.20% for female and male data, respectively

Read more

Summary

INTRODUCTION

S PEAKER verification (SV) seeks to verify the claimed identity of a speaker [1]. It can be generally categorized into text-dependent and text-independent tasks [2], [3]. We study a neural acoustic-phonetic approach for SV of random digit strings in RSR2015 Part III database [12]. In [25], a speaker-utterance dual attention (SUDA) is applied to learn the interaction between speaker and utterance information streams in a unified framework for text-dependent SV. Along this line of thought, we believe that the speaker traits carried by different phonetic contexts are not informative. We propose a neural acoustic-phonetic approach with phonetic attention mask (PAM) that learns to dynamically assign differentiated weights to speaker feature map. We propose a network architecture that is optimized for phonetically informed SV

NEURAL ACOUSTIC-PHONETIC APPROACH
Training Strategy
Database
Experimental Setup
System Training
System Description
Acoustic-Phonetic Scoring
Speaker Verification
Multi-task Experiments
Digit Recognition
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call