Systems with voice control are an attractive option for increasing technological integration, not only for people with little knowledge on technology or constrained Internet access, but also for people with certain disabilities. In addition, devices based on Alexa or Google Home provide an interesting alternative for interacting with Internet of Things (IoT) devices, but they usually rely on an Internet connection to a cloud server for their full operation. Furthermore, many voice-recognition systems are only available in a limited number of languages, which tend to be those with the highest number of speakers, thus excluding minority-language speakers. To address the previously mentioned issues, this article presents a solution based on Edge Computing and voice commands that carries out offline voice processing and that is able to interact with IoT-based systems. The proposed system performs local speech inference, providing a communication interface with IoT devices in a Bluetooth mesh, all in a fast way and without the need for an Internet connection. In addition, the proposed solution can be adapted easily for voice recognition of languages with few resources. Such a feature is demonstrated with the Galician language, which is spoken by less than 3 million people worldwide. In particular, different Automatic Speech Recognition (ASR) models based on three of the most popular ASR development frameworks (wav2vec2, DistilHubert, Whisper) were developed to transcribe short speech and to translate it into IoT commands that perform specific home-automation actions. Such models were fine-tuned for Galician with a corpus of approximately 20 hours and were evaluated in static and mobile opportunistic scenarios in terms of accuracy, energy consumption and latency on an embedded platform (that acts as an edge device) and on a cloud server. The obtained results show that inference is performed in less than 2 seconds on a Raspberry Pi 4 for the two smallest models and in less than 500 ms on a high-end Android smartphone when processing all data locally with CPU-only inference (i.e., without hardware acceleration or external processing). The results of the transcriptions are accurate enough to be able to use simple text distance algorithms to detect keywords in the speech and perform commands on IoT devices. In particular, a maximum success rate of 92% was achieved for detecting the indicated commands when using models optimized for being executed on embedded devices. For selected home scenarios, command actions were sent via Bluetooth with average response times of up to 113 ms.
Read full abstract