Abstract

In speech production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.

Highlights

  • Speech production research crucially relies on articulatory data acquired by various acquisition methods

  • For different choices of A Average Phoneticboundary Distance (APD) averaged over all sentences reduces by ∼6 msec when articulatory features are used in addition to mel-frequency cepstrum coefficient (MFCC) by Joint Acoustic-Articulatory based Temporal Alignment (JAATA)

  • To examine how correlated the mean pixel trajectory is with the corresponding Electromagnetic articulography (EMA) trajectory, we report correlation coefficient (ρ) between the two. ρ, when averaged over all articulators, is 0.59 with a standard deviation (SD) of 0.10. ρ values for different articulators ranges from 0.36 (ULy) to 0.68 (LIx). ρ values suggest that, on an average, the features from the mean intensity over optimum magnetic resonance imaging (MRI) regions are linearly correlated to the respective EMA trajectories

Read more

Summary

Introduction

Speech production research crucially relies on articulatory data acquired by various acquisition methods. Each method has its advantage in terms of the nature of information it offers, while at the same time limited in important ways, notably in terms of the spatiotemporal details offered. X-ray microbeam, Electropalatography, Electromagnetic articulography (EMA) and recently (real-time) magnetic resonance imaging (MRI). EMA offers motion capture of several fleshpoint sensors in two (sagittal) or three dimensional (parasagittal) coordinates with high temporal resolution (100 samples/second in WAVE system), while real-time MRI (rtMRI) provides complete midsagittal (or along any arbitrary 2D scan plane) view of the vocal tract in relatively low temporal resolution (68 × 68 pixel images at 23.180 samples/second [1]). Combining the information from these multimodal sources can be beneficial, but simultaneous acquisition with these techniques is usually not possible because of the cognizant technology requirements and limitations. Algorithmically co-registering and integrating these datasets is the most plausible avenue

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.