Abstract

The various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variations is crucial for understanding speech production, diagnosing speech disorders and planning therapy. Identifying key anatomical landmarks of these structures on medical images is a pre-requisite for any quantitative analysis and the rising amount of data generated in the field calls for an automatic solution. The challenge lies in the high inter- and intra-speaker variability, the mutual interaction between the articulators and the moderate quality of the images. This study addresses this issue for the first time and tackles it by means of Deep Learning. It proposes a dedicated network architecture named Flat-net and its performance are evaluated and compared with eleven state-of-the-art methods from the literature. The dataset contains midsagittal anatomical Magnetic Resonance Images for 9 speakers sustaining 62 articulations with 21 annotated anatomical landmarks per image. Results show that the Flat-net approach outperforms the former methods, leading to an overall Root Mean Square Error of 3.6 pixels/0.36 cm obtained in a leave-one-out procedure over the speakers. The implementation codes are also shared publicly on GitHub.

Highlights

  • In speech, the sounds of a language are produced by varying the shape and position of the organs surrounding the vocal tract

  • Articulatory speech production studies often rely on midsagittal images of the vocal tract area and Magnetic Resonance Imaging (MRI) constitutes in this approach an essential modality[8,9,10]

  • This short review emphasizes the importance of anatomical landmark localization from images for a large variety of applications and this study lengthens this non-exhaustive list with speech production

Read more

Summary

Introduction

The sounds of a language are produced by varying the shape and position of the organs surrounding the vocal tract. This study aims at solving this issue and is to our knowledge the first study to address it This objective takes place in a larger framework in biomedical engineering and computer vision where localizing anatomical landmarks on biomedical images, sometimes referred to as detecting keypoints, has already been considered in other contexts. Localizing the position of the joints of the body on images to estimate the human pose is a long-standing problem[28] It is a challenging issue in computer vision due to the high variability of the postures, body shapes, actions, clothes and scenes. It transforms interestingly coordinates in images, leading to input and output data of same nature, and appears powerful to deal with landmarks in image processing[16,32]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call