Abstract
Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.
Highlights
According to the IHS Markit perspective [1], the number of Internet of Things (IoT) devices is expected to reach 125 billion by 2030. Those IoT has attracted lots of attention in the industry and academia because they can be widely used in many applications [2]. Because of their constrained resources [3], those micro-instruments are commonly named as edge computing devices
To solve the performance and accuracy problems of speech recognition on edge computing devices, we propose a compact Recurrent Neural Network (RNN) which is named EdgeRNN
EdgeRNN consists of 1-Dimensional Convolutional Neural Network (1-D Convolutional Neural Network (CNN)), RNN and attention mechanism, which is a very common network structure for speech recognition
Summary
According to the IHS Markit perspective [1], the number of Internet of Things (IoT) devices is expected to reach 125 billion by 2030. A combination of 1-D CNN and RNN is required to design a speech recognition network model for edge computing devices. To solve the performance and accuracy problems of speech recognition on edge computing devices, we propose a compact RNN which is named EdgeRNN. EdgeRNN consists of 1-D CNN, RNN and attention mechanism, which is a very common network structure for speech recognition. It is the first to be used in speech recognition tasks for edge computing devices This is mainly because of the computations and parameters of 1-D CNN, RNN and attention mechanism. 2) The EdgeRNN model runs on the Raspberry Pi 3B+ can recognize and process 2 voices faster than the time taken to collect the speech This performance meets the practical requirements of speech recognition for edge computing.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have