Abstract

Speech Activity Detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless Steps Challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system.

Highlights

  • Speech Activity Detection (SAD) aims to determine whether an audio signal contains speech or not, and its exact location in the signal

  • Inspired by our previous experiences participating in the Fearless Steps Challenge [23,24], that introduced a new audio domain in the research community, in this paper we aim to explore unsupervised domain adaptation techniques in the context of the SAD task

  • In this Figure, we present the detection error trade-off (DET) curve and equal error rate (EER) for some of the best performing knowledge distillation systems compared to the unadapted baseline system

Read more

Summary

Introduction

Speech Activity Detection (SAD) aims to determine whether an audio signal contains speech or not, and its exact location in the signal. This constitutes an essential preprocessing step in several speech-related applications such as speech and speaker recognition, as well as speech enhancement. SAD is used as a preliminary block to separate the segments of the signal that contain speech from those that are only noise. This way, enabling the overall system to process only the speech segments. In [9], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call