Abstract

Keyword spotting (KWS) is a major component of human-computer interaction for smart on-device terminals and service robots, the purpose of which is to maximize the detection accuracy while keeping footprint size small. In this paper, based on the powerful ability of DenseNet on extracting local feature-maps, we propose a new network architecture (DenseNet-BiLSTM) for KWS. In our DenseNet-BiLSTM, the DenseNet is primarily applied to obtain local features, while the BiLSTM is used to grab time series features. In general, the DenseNet is used in computer vision tasks, and it may corrupt contextual information for speech audios. In order to make DenseNet suitable for KWS, we propose a variant DenseNet, called DenseNet-Speech, which removes the pool on the time dimension in transition layers to preserve speech time series information. In addition, our DenseNet-Speech uses less dense blocks and filters to keep the model small, thereby reducing time consumption for mobile devices. The experimental results show that feature-maps from DenseNet-Speech maintain time series information well. Our method outperforms the state-of-the-art methods in terms of accuracy on Google Speech Commands dataset. DenseNet-BiLSTM is able to achieve the accuracy of 96.6% for the 20-commands recognition task with 223K trainable parameters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call