Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network.

Mona Kirstin Fehling,Bernhard Schick,Fabian Grosch,Maria Elke Schuster,Jörg Lohscheller

doi:10.1371/journal.pone.0227791

Mona Kirstin Fehling, Bernhard Schick + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0227791

Copy DOI

Abstract

The objective investigation of the dynamic properties of vocal fold vibrations demands the recording and further quantitative analysis of laryngeal high-speed video (HSV). Quantification of the vocal fold vibration patterns requires as a first step the segmentation of the glottal area within each video frame from which the vibrating edges of the vocal folds are usually derived. Consequently, the outcome of any further vibration analysis depends on the quality of this initial segmentation process. In this work we propose for the first time a procedure to fully automatically segment not only the time-varying glottal area but also the vocal fold tissue directly from laryngeal high-speed video (HSV) using a deep Convolutional Neural Network (CNN) approach. Eighteen different Convolutional Neural Network (CNN) network configurations were trained and evaluated on totally 13,000 high-speed video (HSV) frames obtained from 56 healthy and 74 pathologic subjects. The segmentation quality of the best performing Convolutional Neural Network (CNN) model, which uses Long Short-Term Memory (LSTM) cells to take also the temporal context into account, was intensely investigated on 15 test video sequences comprising 100 consecutive images each. As performance measures the Dice Coefficient (DC) as well as the precisions of four anatomical landmark positions were used. Over all test data a mean Dice Coefficient (DC) of 0.85 was obtained for the glottis and 0.91 and 0.90 for the right and left vocal fold (VF) respectively. The grand average precision of the identified landmarks amounts 2.2 pixels and is in the same range as comparable manual expert segmentations which can be regarded as Gold Standard. The method proposed here requires no user interaction and overcomes the limitations of current semiautomatic or computational expensive approaches. Thus, it allows also for the analysis of long high-speed video (HSV)-sequences and holds the promise to facilitate the objective analysis of vocal fold vibrations in clinical routine. The here used dataset including the ground truth will be provided freely for all scientific groups to allow a quantitative benchmarking of segmentation approaches in future.

Highlights

In current post-industrial societies a main part of the working population is reliant upon wellfunctioning communication skills
We present for the first time a fully automatic glottis and vocal fold tissue segmentation procedure based on an extended version of the U-Net architecture, which provides single image segmentation
As metric the Dice Coefficient (DC) computed for the glottal area was used since the proper segmentation of the glottis is the most relevant outcome for following voice analysis

Summary

Introduction

In current post-industrial societies a main part of the working population is reliant upon wellfunctioning communication skills. A prerequisite for efficient verbal communication is the production of a proper voice signal which constitutes the carrier signal of speech. Any impairment of the voice production process has a direct impact on the perceivability of speech affecting the communication ability. A cross-sectional survey study carried out by Roy et al in 2005 showed a lifetime prevalence of a voice disorder of up to 29.9% interfering with verbal communication [1]. Work-related absences due to voice disorders as well as medical consultations causing significant socioeconomic costs. The early diagnosis and effective therapy of voice disorders is of great importance

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Feb 10, 2020
Citations: 62	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Vibratory Characteristics of Diplophonia Studied by High Speed Video and Vibrogram Analysis
Peak Woo
Journal of Voice | VOL. 33
Peak WooPeak Woo
30 Oct 2018
Journal of Voice | VOL. 33

Coblation removal of laryngeal Teflon granulomas
Danny Meslemani ... Michael S Benninger
The Laryngoscope | VOL. 120
Danny Meslemani, et. al.Danny Meslemani ... Michael S Benninger
07 Sep 2010
The Laryngoscope | VOL. 120

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos
Sebastian Agethen ... Winston H Hsu
IEEE Transactions on Multimedia | VOL. 22
Sebastian Agethen, et. al.Sebastian Agethen ... Winston H Hsu
13 Aug 2019
IEEE Transactions on Multimedia | VOL. 22

High-speed Imaging of Vocal Fold Vibration Onset Delay: Normal Versus Abnormal
Peak Woo
Journal of Voice | VOL. 31
Peak WooPeak Woo
08 Nov 2016
Journal of Voice | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE