Determining an Optimal Set of Flesh Points on Tongue, Lips, and Jaw for Continuous Silent Speech Recognition

Jun Wang,Seongjun Hahm,Ted Mau

doi:10.18653/v1/w15-5114

Abstract

Articulatory data have gained increasing interest in speech recognition with or without acoustic data. Electromagnetic articulograph (EMA) is one of the affordable, currently used techniques for tracking the movement of flesh points on articulators (e.g., tongue) during speech. Determining an optimal set of sensors is important for optimizing the clinical applications of EMA data, due to the inconvenience of attaching sensors on tongue and other intraoral articulators, particularly for patients with neurological diseases. A recent study found an optimal set (tongue tip and body back, upper and lower lips) on tongue and lips for isolated phoneme, word, or short phrase classification from articulatory movement data. This four-sensor set, however, has not been verified in continuous silent speech recognition. In this paper, we investigated the use of data from sensor combinations in continuous speech recognition to verify the finding using a publicly available data set MOCHA-TIMIT. The long-standing speech recognition approach Gaussian mixture model (GMM)-hidden Markov model (HMM) and a recently available approach deep neural network (DNN)-HMM were used as the recognizers. Experimental results confirmed that the four-sensor set is optimal out of the full set of sensors on tongue, lips, and jaw. Adding upper incisor and/or velum data further improved the recognition performance slightly. Index Terms: silent speech recognition, deep neural network, hidden Markov model, electromagnetic articulograph, articulation, dysarthria

Highlights

With the availability of affordable devices for tongue movement data collection, articulatory data have obtained interest in speech science [1, 2, 3, 4] and in speech technology [5, 6]
We investigated the optimal set of tongue sensors for speaker-dependent continuous silent speech recognition and speech recognition
The acoustic data and the 16-dimensional x and y motion data obtained from upper incisor (UI), lower incisor (LI), V, upper lip (UL), lower lip (LL), tongue tip (TT), tongue blade (TB), and tongue dorsum (TD) were used

Summary

Introduction

With the availability of affordable devices for tongue movement data collection, articulatory data have obtained interest in speech science [1, 2, 3, 4] and in speech technology (i.e., automatic speech recognition) [5, 6]. There are currently limited options to assist speech communication for those individuals (e.g., esophageal speech, tracheo-esophageal speech or tracheo-esophageal puncture (TEP) speech, and electrolarynx). These approaches, produce an abnormal sounding voice [17, 18], which impacts the quality of life of laryngectomees. One of the current challenges of SSI development is silent speech recognition algorithms (without using audio data) [10, 20] or mapping articulatory information to speech [21, 22, 23]

Methods

Results

Conclusion