Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Denis Ivanko,Alexey Karpov,Dmitry Ryumin,Alexandr Axyonov,Irina Kipyatkova

doi:10.1007/978-981-13-9267-2_39

Abstract

Automatic lip-reading (ALR) is a challenging task and a significant amount of research has been devoted to this topic in recent years. However, continuous Russian speech recognition still remains a not well-investigated area. In this paper, we present the results of Russian visual speech recognition (VSR) system using pixel-based and advanced geometry-based features. A HAVRUS video database, comprising of 4000 utterances of continuous Russian speech, collected from 20 speakers, is used in this study. Pixel-based features (principal component analysis-based or PCA) and geometry-based features (active appearance model-based or AAM) were used for the feature representation, and a Gaussian mixture hidden Markov models (HMM) were used for classification. Our evaluation experiments show a significant improvement (up to 9%) in recognition accuracy by using proposed geometry-based features when compared to pixel-based PCA features. The combined VSR is planned for future studies to augment the performance of audio-based automatic speech recognition systems in human–robot interfaces.

Full Text