The keyword spotting (KWS) system is one of the most important interfaces between humans and machines since it is usually the start of automatic speech recognition and natural language processing techniques. However, for KWS hardware, it is still a problem to make one specified chip both low power and high performed under multiple scenarios, such as in meeting rooms, on different traffic or in parks and so on, for different scenarios own wide range signal-noise-ratios (SNRs). The problem leads to the requirements of balanced design between KWS system accuracy and the hardware cost under various noise types and levels. To overcome the balanced design and tradeoff problems, a complete KWS processor including an Mel-Frequency Cepstrum Coefficients (MFCC) feature extractor and a quantized Convolutional Neural Network (QCNN) accelerator is proposed for wide SNR range and low-power KWS in this paper. Firstly, the approach to quantize CNNs into QCNNs with high accuracy is proposed with considerations of hardware-software tradeoff. With the tradeoff of KWS system accuracy and hardware cost, the 4bit/8bit dual-working-mode strategy is proposed to keep low hardware cost and high accuracy under different scenarios. To be specific, the training, tuning and validating of the CNNs and QCNNs are taken with the dataset of 10 keywords chosen from the Google Command Speech Dataset (GCSD). Secondly, a serial FFT based MFCC extractor is implemented with low power and small footprint. Finally, with a novel hybrid reuse strategy of input data and network weight, a reconfigurable and approximate computing based QCNN accelerator is designed. Implemented and verified under TSMC 22nm ULL technology, with the area of 1.42mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> , the QCNN accelerator can achieve 5.26μW/9.08μW power consumption in 4bit/8bit work mode with accuracy of 88% and 93% respectively, which is superior to the state-of-the-art processors.
Read full abstract