Introduction: The effectiveness of modern automatic speech recognition systems in quiet acoustic conditions is quite high and reaches 90-95%. However, in noisy uncontrolled environment, acoustic signals are often distorted, which greatly reduces the resulting recognition accuracy. In adverse conditions, it seems appropriate to use the visual information about the speech, as it is not affected by the acoustic noise. At the moment, there are no studies which objectively reflect the dependence of visual speech recognition accuracy on the video frame rate, and there are no relevant audio-visual databases for model training. Purpose : Improving the reliability and accuracy of the automatic audio-visual Russian speech recognition system; collecting representative audio-visual database and developing an experimental setup. Methods : For audio-visual speech recognition, we used coupled hidden Markov model architectures. For parametric representation of audio and visual features, we used mel-frequency cepstral coefficients and principal component analysis-based pixel features. Results: In the experiments, we studied 5 different rates of video data: 25, 50, 100, 150, and 200 fps. Experiments have shown a positive effect from the use of a high-speed video camera: we achieved an absolute increase in accuracy of 1.48% for a bimodal system and 3.10% for a unimodal one, as compared to the standard recording speed of 25 fps. During the experiments, test data for all speakers were added with two types of noise: wide-band white noise and “babble noise”. Analysis shows that bimodal speech recognition exceeds unimodal in accuracy, especially for low SNR values <15 dB. At very low SNR values <5 dB, the acoustic information becomes non-informative, and the best results are achieved by a unimodal visual speech recognition system. Practical relevance : The use of a high-speed camera can improve the accuracy and robustness of a continuous audio-visual Russian speech recognition system.
Read full abstract