Effective Combination of DenseNet and BiLSTM for Keyword Spotting

Mengjun Zeng,Nanfeng Xiao

doi:10.1109/access.2019.2891838

Mengjun Zeng, Nanfeng Xiao

Open Access

https://doi.org/10.1109/access.2019.2891838

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 66	License type: cc-by-nc-nd

Affiliation: South China University of Technology

Abstract

Keyword spotting (KWS) is a major component of human-computer interaction for smart on-device terminals and service robots, the purpose of which is to maximize the detection accuracy while keeping footprint size small. In this paper, based on the powerful ability of DenseNet on extracting local feature-maps, we propose a new network architecture (DenseNet-BiLSTM) for KWS. In our DenseNet-BiLSTM, the DenseNet is primarily applied to obtain local features, while the BiLSTM is used to grab time series features. In general, the DenseNet is used in computer vision tasks, and it may corrupt contextual information for speech audios. In order to make DenseNet suitable for KWS, we propose a variant DenseNet, called DenseNet-Speech, which removes the pool on the time dimension in transition layers to preserve speech time series information. In addition, our DenseNet-Speech uses less dense blocks and filters to keep the model small, thereby reducing time consumption for mobile devices. The experimental results show that feature-maps from DenseNet-Speech maintain time series information well. Our method outperforms the state-of-the-art methods in terms of accuracy on Google Speech Commands dataset. DenseNet-BiLSTM is able to achieve the accuracy of 96.6% for the 20-commands recognition task with 223K trainable parameters.

Full Text