Abstract

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

Highlights

  • Speaker verification (SV) is the task of verifying that an input utterance is spoken by a claimed speaker

  • SELF-ADAPTIVE SOFT VOICE ACTIVITY DETECTION To improve the robustness of the SV model to long non-speech segments, we proposed self-adaptive soft voice activity detection (VAD) (SAS-VAD) [33], which is the combination of soft VAD and self-adaptive VAD

  • We argue that this is because feature pyramid module (FPM)-based multi-scale aggregation (MSA), SAS-VAD, and maskingbased speech enhancement (SE) improve the robustness of the proposed system to short speech segments and long non-speech segments in noisy and reverberant environments

Read more

Summary

INTRODUCTION

Speaker verification (SV) is the task of verifying that an input utterance is spoken by a claimed speaker. Jung et al.: Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments i-vector system It requires well-annotated training data, and the introduction of the additional DNN-based acoustic model significantly increases the computational complexity. We consider one more requirement which has not been considered in recent SV studies, despite its importance in real-world applications: the SV systems should be robust to the input audio containing long non-speech segments, especially in noisy and reverberant environments. Our previous work [33] shows the need of the robust VAD for SV in real-world environments, where the input audio contains long non-speech segments in noisy and reverberant environments In these adverse environments, the energy-based VAD produces unreliable speech frames, which degrades the performance of SV systems [34].

DEEP SPEAKER EMBEDDING LEARNING
SPEECH ENHANCEMENT
SELF-ADAPTIVE VAD
IMPLEMENTATION DETAILS We extracted two types of acoustic features
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call