Recognition and real time performances of a lightweight ultrasound based silent speech interface employing a language model

Jun Cai,Gérard Dreyfus,Pierre Roussel,Lise Crevier-Buchman,Bruce Denby

doi:10.21437/interspeech.2011-410

Abstract

Abstract The work presents advances in the implementation of an ultrasound based silent speech interface system. Use of a portable acquisition device, a visual speech recognizer system with a language model, and real time tests with the Julius system are described. Experiments with two types of visual feature extraction are also presented. Results show that good recognition and real time performance can be obtained with a portable silent speech interface employing a language model. Index Terms : silent speech interface, visual speech recognition, vocal tract imaging, ultrasound imaging 1. Introduction A silent speech interface (SSI) is intended to enable speech communication in the absence of an intelligible acoustic signal [1]. Several experimental SSI systems have been developed using a variety of different sensors [1]. The REVOIX project at the Sigma Laboratory in Paris is building an SSI meant to restore the voices of speech-impaired individuals in real-time. The technique chosen for REVOIX is to drive a recognizer-synthesizer system using ultrasound and video images of the tongue and lips. The REVOIX SSI thus consists of three modules operating sequentially: (1) an acquisition module to record simultaneous ultrasound and visual images of the vocal tract; (2) a word-level visual speech recognizer that uses Hidden Markov Models trained on features extracted from these images (HTK toolkit [7]), rather than from acoustic features; and (3) a speech synthesizer. To be genuinely useful, such a device will ultimately have to be lightweight, have good recognition and synthesis performance, and operate in real time. In this report, we build upon the groundwork laid in earlier research [2-6] by:  Introducing a new, portable acquisition system;  Comparing different types of visual feature extraction;  Introducing the use of a language model to improve the recognition accuracy;  Experimenting with a real time implementation of the recognition using the Julius system. Our results show that it is possible to obtain good recognition and real time performance using a portable SSI system employing a language model. The visual speech acquisition system and the acquired corpora are described in Section 2 and 3. In Section 4, two visual speech feature extraction techniques, namely the EigenTongues/EigenLips and the Discrete Cosine Transform (DCT), are presented. The experimental results are given in Section 5. Conclusions are drawn in Section 6.

Full Text