Automatic Speech Recognition in Different Languages Using High-Density Surface Electromyography Sensors

Mingxing Zhu,Guoru Zhao,Cheng Wang,Shixiong Chen,Haoshi Zhang,Xiaochen Wang,Xin Wang,Zhen Huang,Guanglin Li

doi:10.1109/jsen.2020.3037061

Abstract

Automatic speech recognition (ASR) based on surface electromyography (sEMG) sensors is an important technology converting electrical signals into computer-readable textual messages, which can overcome the limitation of acoustic sensors that are easily contaminated by environmental noises. However, current placements of sEMG sensors mainly depend on the experimenter’s experience, which could miss important information about the major muscular activities and lead to the decline of classification performance. In this study, 120 closely-spaced sEMG sensors were utilized to collect high-density sEMG signals for recognizing ten digits in English and Chinese. The linear discriminant analysis classifier was used to classify the speaking tasks, and the sequential forward selection algorithm was utilized for analyzing the optimal position of the sensors. The results showed that the HD sEMG energy maps could help visualize the dynamic muscle activities during the speaking process, and significantly different muscular contraction patterns were observed for different speaking tasks. The classification accuracies when using the facial sensors were significantly lower than those on the neck, although with the same number of sensors. Moreover, the classification rates could be higher than 90% with only 15 optimally selected sensors that were mainly distributed on the neck instead of the face. This study suggests that the neck muscles could be the main contributor, and more sEMG sensors should be placed on the neck to improve the ASR performance. The findings of this study could provide valuable clues for the development of a practical sEMG-based speech recognition system, especially for patients with speaking disorders.

Highlights

The dynamic HD surface electromyography (sEMG) topographic energy maps, which could demonstrate the energy distribution of the articulatory muscular activities when the subject was speaking, were constructed from the sEMG signals and a typical example was shown in Fig. 4, where high energy intensity was represented by red color
Multi-channel sEMG sensors (120 channels) were placed on the facial and neck muscles with high spatial resolution, and the recorded high-density sEMG (HD sEMG) signals were used for automatic speech recognition of English and Chinese digits
The energy maps calculated from the HD sEMG signals showed that the muscular activities of different locations demonstrated significant patterns during the speaking process, and they could help to visualize the dynamic energy distribution of the articulatory muscular activities