Abstract

Automatic speaker verification provides a flexible and effective way for biometric authentication. Previous deep learning-based methods have demonstrated promising results, whereas a few problems still require better solutions. In prior works examining speaker discriminative neural networks, the speaker representation of the target speaker is regarded as a fixed one when comparing with utterances from different speakers, and the joint information between enrollment and evaluation utterances is ignored. In this paper, we propose to combine CNN-based feature learning with a bidirectional attention mechanism to achieve better performance with only one enrollment utterance. The evaluation-enrollment joint information is exploited to provide interactive features through bidirectional attention. In addition, we introduce one individual cost function to identify the phonetic contents, which contributes to calculating the attention score more specifically. These interactive features are complementary to the constant ones, which are extracted from individual speakers separately and do not vary with the evaluation utterances. The proposed method archived a competitive equal error rate of 6.26% on the internal “DAN DAN NI HAO” benchmark dataset with 1250 utterances and outperformed various baseline methods, including the traditional i-vector/PLDA, d-vector, self-attention, and sequence-to-sequence attention models.

Highlights

  • Automatic speaker verification (SV) aims to verify the identity of a person based on his/her voice

  • To address the above-mentioned concerns, we propose a novel framework based on a bidirectional attention and convolution neural network (BaCNN) to generate dynamic speaker representations for both enrollment utterance and evaluation utterance and to verify the speaker’s identity effectively

  • Below we show the performance of the detection error trade-off (DET) curves, which demonstrate the error at different operating points

Read more

Summary

Introduction

Automatic speaker verification (SV) aims to verify the identity of a person based on his/her voice. It can be categorized into text-dependent and text-independent types, according to whether the lexicon content of the enrollment utterance is the same as that of evaluation utterance [1,2,3,4]. The text-dependent SV (TDSV) outperforms the text-independent type due to its phonetic variability and robust handling of short utterances [5,6]. With the development of smartphone and mobile applications, interacting with mobile devices through a short speech is becoming more and more popular, and voice authentication through a given speech password has been widely accepted [7]. The Sensors 2020, 20, 6784; doi:10.3390/s20236784 www.mdpi.com/journal/sensors

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call