Abstract
Voice control is an important function in many mobile devices, in a smart home, especially in providing people with disabilities a convenient way to communicate with the device. Despite many studies on this problem in the world, there has not been a formal study for the Vietnamese language. In addition, many studies did not offer a solution that can be expanded easily in the future. During this study, a dataset of Vietnamese speech commands is labeled and organized to be shared with community of general language research and Vietnamese language study in particular. This paper provides a speech collection and processing software. This study also designs and evaluates Recurrent Neural Networks to apply it to the data collected. The average recognition accuracy on the set of 15 commands for controlling smart home devices is 98.19%.
Highlights
Interaction and control of household devices is a fast trend, evident in the exponentially growing number of smart-homes
The results show that a throat microphone is robust in noisy environment, achieving a 95.4% hit rate in a speech recognition system with multiple Neural Networks (NNs) using the oneagainst-all approach, while a simple NN could only reach 91.88%
In “End-to-End Speech Command Recognition with Capsule Network” [8], Jaesung Bae, Dae-Shik Kim realize that Convolutional Neural Networks (CNNs) are capable of capturing the local features effectively
Summary
Interaction and control of household devices is a fast trend, evident in the exponentially growing number of smart-homes. In “Binary Neural Networks for Classification of Voice Commands from Throat Microphone” [2], the authors uses binary classifiers and Neural Networks (NNs), together with a perceptual linear prediction method for feature extraction to increase the classification rate of voice commands captured using a throat microphone, comparing this method with a single NN They create a dataset of 150 people (men and women). In “End-to-End Speech Command Recognition with Capsule Network” [8], Jaesung Bae, Dae-Shik Kim realize that CNNs are capable of capturing the local features effectively. They can be used for tasks which have relatively short-term dependencies, such as keyword spotting or phoneme-level sequence recognition.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have