Abstract

Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.

Highlights

  • According to the IHS Markit perspective [1], the number of Internet of Things (IoT) devices is expected to reach 125 billion by 2030. Those IoT has attracted lots of attention in the industry and academia because they can be widely used in many applications [2]. Because of their constrained resources [3], those micro-instruments are commonly named as edge computing devices

  • To solve the performance and accuracy problems of speech recognition on edge computing devices, we propose a compact Recurrent Neural Network (RNN) which is named EdgeRNN

  • EdgeRNN consists of 1-Dimensional Convolutional Neural Network (1-D Convolutional Neural Network (CNN)), RNN and attention mechanism, which is a very common network structure for speech recognition

Read more

Summary

INTRODUCTION

According to the IHS Markit perspective [1], the number of Internet of Things (IoT) devices is expected to reach 125 billion by 2030. A combination of 1-D CNN and RNN is required to design a speech recognition network model for edge computing devices. To solve the performance and accuracy problems of speech recognition on edge computing devices, we propose a compact RNN which is named EdgeRNN. EdgeRNN consists of 1-D CNN, RNN and attention mechanism, which is a very common network structure for speech recognition. It is the first to be used in speech recognition tasks for edge computing devices This is mainly because of the computations and parameters of 1-D CNN, RNN and attention mechanism. 2) The EdgeRNN model runs on the Raspberry Pi 3B+ can recognize and process 2 voices faster than the time taken to collect the speech This performance meets the practical requirements of speech recognition for edge computing.

RELATED WORK
DESIGN OF EdgeRNN The EdgeRNN model is divided into the following parts
TIME INFORMATION EXTRACTION LAYER
SELF-ATTENTION MECHANISM LAYER AND CLASSIFICATION LAYER
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call