Abstract

An electrolarynx (EL) is a widely used device to mechanically generate excitation signals, making it possible for laryngectomees to produce EL speech without vocal fold vibrations. Although EL speech sounds relatively intelligible, is significantly less natural than normal speech owing to its mechanical excitation signals. To address this issue, a statistical voice conversion (VC) technique based on Gaussian mixture models (GMMs) has been applied to EL speech enhancement. In this technique, input EL speech is converted into target normal speech by converting spectral features of the EL speech into spectral and excitation parameters of normal speech using GMMs. Although this technique makes it possible to significantly improve the naturalness of EL speech, the enhanced EL speech is still far from the target normal speech. To improve the performance of statistical EL speech enhancement, in this paper, we propose an EL-to-speech conversion method based on CLDNNs consisting of convolutional layers, long short-term memory recurrent layers, and fully connected deep neural network layers. Three CLDNNs are trained, one to convert EL speech spectral features into spectral and band-aperiodicity parameters, one to convert them into unvoiced/voiced symbols, and one to convert them into continuous $F_{0}$ patterns. The experimental results demonstrate that the proposed method significantly outperforms the conventional method in terms of both objective evaluation metrics and subjective evaluation scores.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call