Learning a representation of tongue dynamics from unlabeled ultrasound videos

Hongcui Wang,Bruce Denby,Pierre Roussel

doi:10.1121/1.5137727

Abstract

Ultrasound imaging of the tongue has been used for decades in studies of speech production and speech motor control, for silent speech interfaces, and in numerous other areas. Despite substantial efforts, however, extraction of reliable features from ultrasound tongue data remains a challenge due to speckle noise and acoustic propagation issues. Recently, Representation Learning has emerged in a variety of fields as a powerful means of generating useful representations of underlying structure in raw, high-dimensional data. In its unsupervised form, Representation Learning discovers structures in unlabelled data, thereby eliminating the need for a time-consuming labelling step. The present work is believed to be the first use of unsupervised Representation Learning to reveal structures related to tongue dynamics in unlabelled ultrasound video. A 3-D Convolutional Neural Network examining a series of unlabelled 60 Hz tongue images is found to accurately predict unseen future images even for large interframe tongue displacements. By comparing the 3DCNN prediction error to that of a simple previous-frame predictor, tongue trajectories containing transitions between regions of acoustic stability can be identified and correlated with formant trajectories in a spectrogram. Prospects for leveraging the tongue dynamic representation for use in subsequent speech processing tasks will be discussed.

Full Text