A 108-nW 0.8-mm2 Analog Voice Activity Detector Featuring a Time-Domain CNN With Sparsity-Aware Computation and Sparsified Quantization in 28-nm CMOS

Feifei Chen,Ka-Fai Un,Rui P Martins,Pui-In Mak,Wei-Han Yu

doi:10.1109/jssc.2022.3191008

Abstract

This article reports a passive analog feature extractor for realizing an area-and-power-efficient voice activity detector (VAD) for voice-control edge devices. It features a switched-capacitor circuit as the time-domain convolutional neural network (TD-CNN) that extracts the 1-bit features for the subsequent binarized neural network (BNN) classifier. TD-CNN also allows area savings and low latency by evaluating the features temporally. The applied sparsity-aware computation (SAC) and sparsified quantization (SQ) aid in enlarging the output swing and reducing the model size without sacrificing the classification accuracy. With these techniques, the diversified output also aids in desensitizing the 1-bit quantizer from the offset and noise. The TD-CNN and BNN are trained as a single network to improve the VAD reconfigurability. Benchmarking with the prior art, our VAD in 28-nm CMOS scores a 90% (94%) speech (non-speech) hit rate on the TIMIT dataset with small power (108 nW) and area (0.8 mm2). We can configure the TD-CNN as a feature extractor for keyword spotting (KWS). It achieves a 93.5% KWS accuracy with the Google speech command dataset (two keywords). With two TD-CNNs operating simultaneously to extract more features, the KWS accuracy is 94.3%.

Full Text