Abstract

Keyword spotting plays a crucial role in realizing voice-based user interaction on intelligent equipment terminals and service robots. In this task, it remains challenging to achieve the balance between low memory and high precision. To better satisfy this requirement, we propose an end-to-end neural architecture with sandglass residual blocks embedded with the gated channel-wise attention mechanism. The sandglass residual blocks utilize 1D separable convolutions to extract bottleneck temporal features, which can effectively drive the model to focus more on the speech segment with lower parameters. Especially, the gated attention mechanism helps the model enhance the critical speech temporal features and suppress the useless ones and further focus on the most important part of the human speech region for keyword spotting. The experimental results on Google Speech Commands Dataset show that our proposed model has an accuracy of 97.4[Formula: see text] with only 46K parameters. Compared with the baseline method with the highest accuracy, our model parameters are decreased by 54[Formula: see text] and accuracy is increased by 0.8[Formula: see text]. That makes us take further step in achieving the goal of low memory and high precision.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call