Supervised Contrastive Learning for Voice Activity Detection

Youngjun Heo,Sunggu Lee

doi:10.3390/electronics12030705

Youngjun Heo, Sunggu Lee

Open Access

https://doi.org/10.3390/electronics12030705

Copy DOI

Journal: Electronics	Publication Date: Jan 31, 2023
Citations: 1	License type: CC BY 4.0

Affiliation: Pohang University of Science and Technology

Abstract

The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have been recently studied by researchers, a proper training strategy for VAD has not received sufficient attention. Thus, a training strategy for VAD using supervised contrastive learning is proposed for the first time in this paper. The proposed method is used in conjunction with audio-specific data augmentation methods. The proposed supervised contrastive learning-based VAD (SCLVAD) method is trained using two common speech datasets and then evaluated using a third dataset. The experimental results show that the SCLVAD method is particularly effective in improving VAD performance in noisy environments. For clean environments, data augmentation improves VAD accuracy by 8.0 to 8.6%, but there is no improvement due to the use of supervised contrastive learning. On the other hand, for noisy environments, the SCLVAD method results in VAD accuracy improvements of 2.9% and 4.6% for “speech with noise” and “speech with music”, respectively, with only a negligible increase in processing overhead during training.

Full Text