Abstract

Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We have used a neural network based voice conversion method with the aim of improving the intelligibility and reducing the listening effort (LE) of four OS speakers of varying speaking proficiency. The novelty of this method is the use of synthetic speech matched in duration with the source OS as the target, instead of parallel aligned healthy speech. We evaluated the converted samples from this system using a collection of Automatic Speech Recognition systems (ASR), an objective intelligibility metric (STOI) and a subjective test. ASR evaluation shows that the proposed system had significantly better word recognition accuracy compared to unprocessed OS, and baseline systems which used aligned healthy speech as the target. There was an improvement of at least 15% on STOI scores indicating a higher intelligibility for the proposed system compared to unprocessed OS, and a higher target similarity in the proposed system compared to baseline systems. The subjective test reveals a significant preference for the proposed system compared to unprocessed OS for all OS speakers, except one who was the least proficient OS speaker in the data set.

Highlights

  • Laryngectomy is the surgical procedure of removing the larynx [1]

  • We evaluated the outputs of our proposed enrichment system using three Automatic Speech Recognition systems (ASR) systems: the speech-to-text system from Microsoft Azure using the python azure-cognitiveservicesspeech library (ASR 1) [42], the Elhuyar speech recognition system (ASR 2) [43] and a Kaldi [36] based system (ASR 3) developed in our laboratory and described in [44]

  • The input files to these ASR systems were the 100 single channel wav files sampled at

Read more

Summary

Introduction

Laryngectomy is the surgical procedure of removing the larynx [1]. In addition to several functional disorders and lifestyle changes [2], this results in the loss of vocal folds and the patient’s pre-surgery speech [3]. Generating OS introduces acoustic artefacts [6] and makes OS less intelligible [7,8], which affects communication, social activities and quality of life [2,9]. OS is less intelligible and more effortful to listen to compared to healthy speech (HS). This is evident from previous listening experiments [10,11] as well as acoustic characteristics and challenges of OS [12]. Prolonged exposure to effortful speech causes fatigue in listeners [13]. We aim to enrich OS by closing the OS-HS gaps in intelligibility, quality and listening effort (LE)

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.