Abstract

For spoken language processing applications like speaker recognition/verification, not only that the silence segments do not contribute any speaker specific information, but also it dilutes the already available information content in the speech segments in the audio data. It has been experimentally studied that removing silence segments with the help of a voice activity detector(VAD) from the utterance before feature extraction enhances the performance of speaker recognition systems. Empirical algorithms using signal energy and spectral centroid(ESC) is one of the most popular approaches to VAD. In this paper, we show that using spectral matching (SM) to distinguish between silence and speech segments for VAD outperforms the VAD using ESC. We use a neural network with TempoRAl PatternS (TRAPS) of critical band energies as its input for improved performance. We evaluate the performance of VADs using a speaker recognition system developed for 20 speakers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call