Abstract
Midsagittal ultrasound imaging of the tongue is a portable and inexpensive way to provide articulatory information. However, although ultrasound images show a portion of the tongue surface, other vocal tract structures (e.g., palate) are not typically visible. This missing information may be useful for speech therapy and other applications, e.g., by characterizing vocal tract constrictions and informing how morphological variations affect speech patterns. Prediction of the vocal tract shape from information available during ultrasound imaging (e.g., tongue contours and audio recordings) is, thus, potentially valuable. Recent advancements in articulatory prediction from audio recordings (i.e., acoustic inversion) and speech recognition using combined articulatory and acoustic data have used neural network models. Inspired by these models, this study investigates how well fusion of articulatory and acoustic features in speaker-independent models can predict expanded articulatory information. Specifically, recurrent neural network models will be trained to predict the vocal tract shape based on partial tongue contours and acoustic features, during production of vowels and central approximants. Features will be extracted from simultaneously recorded audio and 2D MRI (USC 75-Speaker Database). Different acoustic features and network architectures will be compared, with the goal of refining future models to predict vocal tract shapes during ultrasound imaging.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.