Spatial and temporal alignment of multimodal human speech production data: Real time imaging, flesh point tracking and audio

Jangwon Kim,Shrikanth S Narayanan,Prasanta Ghosh,Adam Lammert

doi:10.1109/icassp.2013.6638336

Jangwon Kim, Shrikanth S Narayanan + Show 2 more

Open Access

PDF Available

https://doi.org/10.1109/icassp.2013.6638336

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

In speech production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.

Highlights

Speech production research crucially relies on articulatory data acquired by various acquisition methods
For different choices of A Average Phoneticboundary Distance (APD) averaged over all sentences reduces by ∼6 msec when articulatory features are used in addition to mel-frequency cepstrum coefficient (MFCC) by Joint Acoustic-Articulatory based Temporal Alignment (JAATA)
To examine how correlated the mean pixel trajectory is with the corresponding Electromagnetic articulography (EMA) trajectory, we report correlation coefficient (ρ) between the two. ρ, when averaged over all articulators, is 0.59 with a standard deviation (SD) of 0.10. ρ values for different articulators ranges from 0.36 (ULy) to 0.68 (LIx). ρ values suggest that, on an average, the features from the mean intensity over optimum magnetic resonance imaging (MRI) regions are linearly correlated to the respective EMA trajectories

Summary

Introduction

Speech production research crucially relies on articulatory data acquired by various acquisition methods. Each method has its advantage in terms of the nature of information it offers, while at the same time limited in important ways, notably in terms of the spatiotemporal details offered. X-ray microbeam, Electropalatography, Electromagnetic articulography (EMA) and recently (real-time) magnetic resonance imaging (MRI). EMA offers motion capture of several fleshpoint sensors in two (sagittal) or three dimensional (parasagittal) coordinates with high temporal resolution (100 samples/second in WAVE system), while real-time MRI (rtMRI) provides complete midsagittal (or along any arbitrary 2D scan plane) view of the vocal tract in relatively low temporal resolution (68 × 68 pixel images at 23.180 samples/second [1]). Combining the information from these multimodal sources can be beneficial, but simultaneous acquisition with these techniques is usually not possible because of the cognizant technology requirements and limitations. Algorithmically co-registering and integrating these datasets is the most plausible avenue

Methods

Results

Conclusion