NEURAL NETWORK ARCHITECTURE FOR TEXT DECODING BASED ON SPEAKER'S LIP MOVEMENTS

Olesia Barkovska,Vladyslav Kholiev

doi:10.31891/csit-2023-4-7

Abstract

In this paper, we tested a command recognition system using the SSI approach and conducted a series of experiments on modern solutions based on ALR interfaces. The main goal was to improve the accuracy of speech recognition in cases where it is not possible to use the speaker's non-noisy audio sequence, for example, at a great distance from the speaker or in a noisy environment. The obtained results showed that training the neural network on a GPU accelerator allowed to reduce the training time by 26.2 times using a high-resolution training sample with a size of the selected mouth area of 150 × 100 pixels. The results of the analysis of the selected speech recognition quality assessment metrics (word recognition rate (WRR), word error rate (WER), and character error rate (CER)) showed that the maximum word recognition rate of the speaker's speech is 96.71% and is achieved after 18 epochs of training. If we evaluate the character regonition rate of viseme recognition, the highest rate can be obtained after 13 epochs of training. Future research will focus on the use of depth cameras and stereo vision methods with increased frame rates to further improve the accuracy of voice command decoding in conditions of high background noise.

Full Text