Abstract

The rapid development of deep learning technology has significantly improved the performance of the keywords spotting, but at the same time it also requires more labeled data. This paper proposes a neural network based end-to-end (E2E) model for keywords spotting, which joints the attention mechanism as well as requires only a little supervision data. The proposed attention-based E2E Keywords spotting model consists of three main modules: a keyword embedding module, an acoustic module and a keywords spotting module. The keyword embedding module is used to obtain the embedding vector of the keyword. The acoustic module combines keyword embedding vector and audio feature to obtain the corresponding feature vector with attention mechanism. The keywords spotting module makes the corresponding feature vector as input to detect whether keyword occurs in the audio or not. We experiment with 13 and 20 different keywords on the AISHELL-2 datasets. The results show that the false alarm rates are 0.17 times per hour and 0.11 times per hour, and the false rejection rates are 3.98% and 3.36%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call