Abstract

Speech activity detection (SAD) is front-end in most speech systems, e.g., speaker verification, speech recognition etc . Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequency-dependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDK-SAD features. We further proposed two decision backends: First, variable model-size Gaussian mixture model (VMGMM); and second, Hartigan dip-based robust feature clustering. While VMGMM is a model-based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases: first, standalone SAD performance; and second, the effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two Center for Robust Speech Systems (CRSS) corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of the proposed approaches with multiple baselines including SohnSAD, rSAD, semisupervised Gaussian mixture model, and Gammatone spectrogram features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call