Abstract

The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have been recently studied by researchers, a proper training strategy for VAD has not received sufficient attention. Thus, a training strategy for VAD using supervised contrastive learning is proposed for the first time in this paper. The proposed method is used in conjunction with audio-specific data augmentation methods. The proposed supervised contrastive learning-based VAD (SCLVAD) method is trained using two common speech datasets and then evaluated using a third dataset. The experimental results show that the SCLVAD method is particularly effective in improving VAD performance in noisy environments. For clean environments, data augmentation improves VAD accuracy by 8.0 to 8.6%, but there is no improvement due to the use of supervised contrastive learning. On the other hand, for noisy environments, the SCLVAD method results in VAD accuracy improvements of 2.9% and 4.6% for “speech with noise” and “speech with music”, respectively, with only a negligible increase in processing overhead during training.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call