Abstract

Identifying the underlying relationship between visual movements in tagged-MRI and intelligible speech is a vital problem to better understand speech production in health and disease. Due to their heterogeneous representations, however, direct mapping between the two modalities is challenging. We develop a deep learning framework that can synthesize a sequence of tagged MRI data to its corresponding mel-spectrogram, and then convert back into the audio waveform. Our network adopts a parallel encoder-decoder structure to take as input a pair of tagged MRI sequences. The 3D CNN-based encoders learn to extract the feature of spatiotemporally varying motions. The decoder then learns to generate the corresponding spectrograms conditioned on the latent space feature. For the pair of the same utterance, we further make the latent space feature as close as possible with the Kullback-Leibler divergence. To demonstrate the performance of our framework, we used a leave-one-out evaluation strategy on a total of 63 tagged MRI sequences from two utterances, including 43 “ageese” and 20 “asouk.” Our framework enabled the generation of clear audio given a sequence of tagged MRI unseen in training, which could potentially aid in better understanding speech production and improving treatment strategies for patients with speech-related disorders.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call