An MRI‐based time‐domain speech synthesis system

Tatsuya Kitamura,Toshio Hirai,Parham Mokhtari,Hironori Takemoto

doi:10.1121/1.4787184

Abstract

A speech synthesis system was developed based on Maedas method [S. Maeda, Speech Commun. 1, 199–229 (1982)], which simulates acoustic wave propagation in the vocal tract in the time domain. This system has a GUI interface that allows fine control of synthesis parameters and timing. In addition, the piriform fossae were included to the vocal tract model, resulting in antiresonances in speech spectra at the frequency region from 4 to 5 kHz. The system can produce all the Japanese phonemes using vocal tract area‐functions (VTAFs) extracted from 3‐D cine‐MRI obtained during production of VCV or CVCV sequences for a male speaker. The system can be used to synthesize Japanese sentences with high naturalness and intelligibility by concatenating segmental units and controlling the glottal source using the GUI interface. Since a time‐varying VTAF is obtained by interpolating between VTAFs, the dataset size of the system is significantly smaller than that of corpus‐based speech synthesizers. The speaker‐specific VTAFs and inclusion of the piriform fossae permit us to reproduce speaker‐specific spectral shapes, not only the lower formants but also higher frequency regions that contribute to the perception of speaker individualities. [Work supported by NICT, SCOPE‐R, and Grant‐in‐And for Scientific Research of Japan.]

Full Text