Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning

Jose A Gonzalez,Roger K Moore,Lam A Cheah,Ed Holdsworth,Phil D Green,James M Gilbert,Stephen R Ell,Angel M Gomez

doi:10.1109/taslp.2017.2757263

Abstract

This paper describes a technique that generates speech acoustics from articulator movements. Our motivation is to help people who can no longer speak following laryngectomy, a procedure that is carried out tens of thousands of times per year in the Western world. Our method for sensing articulator movement, permanent magnetic articulography, relies on small, unobtrusive magnets attached to the lips and tongue. Changes in magnetic field caused by magnet movements are sensed and form the input to a process that is trained to estimate speech acoustics. In the experiments reported here this “Direct Synthesis” technique is developed for normal speakers, with glued-on magnets, allowing us to train with parallel sensor and acoustic data. We describe three machine learning techniques for this task, based on Gaussian mixture models, deep neural networks, and recurrent neural networks (RNNs). We evaluate our techniques with objective acoustic distortion measures and subjective listening tests over spoken sentences read from novels (the CMU Arctic corpus). Our results show that the best performing technique is a bidirectional RNN (BiRNN), which employs both past and future contexts to predict the acoustics from the sensor data. BiRNNs are not suitable for synthesis in real time but fixed-lag RNNs give similar results and, because they only look a little way into the future, overcome this problem. Listening tests show that the speech produced by this method has a natural quality that preserves the identity of the speaker. Furthermore, we obtain up to 92% intelligibility on the challenging CMU Arctic material. To our knowledge, these are the best results obtained for a silent-speech system without a restricted vocabulary and with an unobtrusive device that delivers audio in close to real time. This work promises to lead to a technology that truly will give people whose larynx has been removed their voices back.

Highlights

S ILENT speech refers to a form of spoken communication which does not depend on the acoustic signal from the speaker
The principle of an silent speech interface (SSI) is that the speech that a person wishes to produce can be inferred from non-acoustic sources of information generated during speech articulation, such as the brain’s electrical activity [2], [3], the electrical activity produced by the articulator muscles [4]–[6] or the movement of the speech articulators [7]–[10]
In comparison with previous work, in this paper we carry out an extensive evaluation on the effect on the speech quality generated by the Gaussian mixture models (GMMs) and deep neural networks (DNNs) mapping approaches when using the following features in the mapping: (i) segmental, contextual features computed by concatenating several permanent magnet articulography (PMA) samples to capture the articulator dynamics, (ii) the maximum likelihood parameter generations (MLPGs) algorithm [26], [27] to obtain smoother temporal trajectories for the predicted speech features, and (iii) conversion considering the global variance (GV) of the speech features, which has been shown to improve the perceived quality for speech synthesis and voice conversion (VC), but have not been extensively investigated for articulatory-to-speech conversion

Summary

INTRODUCTION

S ILENT speech refers to a form of spoken communication which does not depend on the acoustic signal from the speaker. In comparison with previous work, in this paper we carry out an extensive evaluation on the effect on the speech quality generated by the GMM and DNN mapping approaches when using the following features in the mapping: (i) segmental, contextual features computed by concatenating several PMA samples to capture the articulator dynamics, (ii) the maximum likelihood parameter generations (MLPGs) algorithm [26], [27] to obtain smoother temporal trajectories for the predicted speech features, and (iii) conversion considering the global variance (GV) of the speech features, which has been shown to improve the perceived quality for speech synthesis and voice conversion (VC), but have not been extensively investigated for articulatory-to-speech conversion.

STATISTICAL ARTICULATORY-TO-SPEECH MAPPING

Conventional GMM-based mapping technique

DNN-based conversion

Mapping using RNNs

EXPERIMENTS

Evaluation setup

Results

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Dec 1, 2017
Citations: 101	License type: cc-by

R Discovery Prime

R Discovery Prime

Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

Evaluation of a Silent Speech Interface Based on Magnetic Sensing and Deep Learning for a Phonetically Rich Vocabulary
Jose A Gonzalez ... Ed Holdsworth
-
Jose A Gonzalez, et. al.Jose A Gonzalez ... Ed Holdsworth
20 Aug 2017
20 Aug 2017

Improvement of bidirectional recurrent neural network for learning long-term dependencies
...
-
, et. al. ...
23 Aug 2004
23 Aug 2004

Capturing Long-Term Dependencies for Protein Secondary Structure Prediction
Jinmiao Chen ... Narendra S Chaudhari
-
Jinmiao Chen, et. al.Jinmiao Chen ... Narendra S Chaudhari
01 Jan 2004
01 Jan 2004

Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction
J Chen ... N.S Chaudhari
Soft Computing - A Fusion of Foundations, Methodologies and Applications | VOL. 10
J Chen, et. al.J Chen ... N.S Chaudhari
18 May 2005
Soft Computing - A Fusion of Foundations, Methodologies and Applications | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing