A depthwise separable convolutional neural network for keyword spotting on an embedded system

Peter Mølgaard Sørensen,Tobias May,Bastian Epp

doi:10.1186/s13636-020-00176-2

Abstract

A keyword spotting algorithm implemented on an embedded system using a depthwise separable convolutional neural network classifier is reported. The proposed system was derived from a high-complexity system with the goal to reduce complexity and to increase efficiency. In order to meet the requirements set by hardware resource constraints, a limited hyper-parameter grid search was performed, which showed that network complexity could be drastically reduced with little effect on classification accuracy. It was furthermore found that quantization of pre-trained networks using mixed and dynamic fixed point principles could reduce the memory footprint and computational requirements without lowering classification accuracy. Data augmentation techniques were used to increase network robustness in unseen acoustic conditions by mixing training data with realistic noise recordings. Finally, the system’s ability to detect keywords in a continuous audio stream was successfully demonstrated.

Highlights

During the last decade, deep learning algorithms have continuously improved performances in a wide range of applications, among others automatic speech recognition (ASR) [1]
This paper extends previous efforts [17] to implement a Keyword spotting (KWS) system based on a DS-convolutional neural network (CNN) by (a) identifying performance-critical elements in the system when scaling the network complexity, (b) augmenting training data with a wider variety of realistic noise recordings and by using a controlled range of signal-to-noise ratios (SNRs) that are realistic for practical KWS applications during both training and testing
For SNRs below 20 dB, the network trained on noisy data had a higher test accuracy than the network trained on clean data, while the accuracy was slightly lower for SNRs higher than 20 dB

Summary

Introduction

Deep learning algorithms have continuously improved performances in a wide range of applications, among others automatic speech recognition (ASR) [1]. Artificial intelligence (AI) digital assistants utilize natural speech as the primary user interface and often require access to cloud computation for the demanding processing tasks. Such cloud-based solutions are impractical for many devices and cause user concerns due to the requirement of continuous internet access and due to concerns regarding privacy when transmitting audio continuously to the cloud [2]. In contrast to these large-vocabulary ASR systems, devices with more limited functionality could be more efficiently controlled using.

Methods

Results

Discussion

Conclusion