Abstract

The mismatches due to pitch, speaking rate, formant dispersion and ambient noise deteriorate the performance of an automatic keyword spotting (KWS) system. The work presented in this paper aims at reducing the aforementioned mismatches through front-end signal processing. In the proposed approach, the short-term magnitude spectra (ST-MS) are firstly computed with a smaller frameshift and then averaged over the adjacent frames to enhance the formant regions. The formant regions in the ST-MS have a higher magnitude than the nearing frequency regions. Consequently, temporal averaging of ST-MS over adjacent frames suppresses the high-frequency variation due to pitch, and formant dispersion. Furthermore, the formant peaks can be more accurately detected from the temporal averaged ST-MS when compared to detection from the original ST-MS. The Mel frequency cepstral coefficients (MFCC) computed from the temporally averaged magnitude spectra (TAS-MFCC) are pitch robust compared to the MFCC, and MFCC extracted from the reported spectral smoothing approaches employing variational mode decomposition (VMD-MFCC), pitch adaptive cepstral truncation (PACT-MFCC) and single-pole filter (SPS-MFCC). Performance of TA-MFCC feature in mismatched test condition is further improved by appending five logarithmically compressed resonant peaks at least separated by 400 Hz, here this feature is termed as TAS-MFCC-ARP. The spectral peaks mostly represent the formants in ST-MS. The performances of the deep neural network-hidden Markov model-based children’s KWS system reported in this work show that the TAS-MFCC-ARP provides a relative performance improvement of 103.83% compared to MFCC. The performance of the KWS system is further improved by data-augmented training through duration modification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call