Abstract

Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data.

Highlights

  • S PEECH chain, a concept introduced by Denes et al [1], describes the basic mechanism involved in speech communication when a spoken message travels from the speaker‘s mind to the listener’s mind (Fig. 1)

  • The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data

  • In the subjective evaluation, based on the quality of the synthesized speech using mean opinion score (MOS) test based on five-point scale (5: very good - 1: very poor)

Read more

Summary

INTRODUCTION

S PEECH chain, a concept introduced by Denes et al [1], describes the basic mechanism involved in speech communication when a spoken message travels from the speaker‘s mind to the listener’s mind (Fig. 1). Due to deep learning’s representational power, many complicated hand-engineered models have been simplified by letting deep neural nets (DNNs) learn their way from input to output spaces With this newly emerging approach to sequence-to-sequence mapping tasks, a model with a common architecture can directly learn the mapping between variable-length representations of different modalities: text-to-text sequences [21], [22], speechto-text sequences [23], [24], text-to-speech sequences [25], and image-to-text sequences [26], etc. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning and construct a sequence-to-sequence model for both ASR and TTS tasks, as well as a loop connection between these two processes. Our contributions in this paper include: 1) Basic machine speech chain that integrates ASR and TTS and performs on single-speaker task. 2) Multi-speaker speech chain with a speaker-embedding network for handling speech with different voice characteristics. 3) Machine speech chain with a straight-through estimator to allow end-to-end feedback loss through discrete units or subwords

RELATED WORKS
Overview
8: Calculate the loss for ASR and TTS
Sequence-to-Sequence Model for ASR
Sequence-to-Sequence Model for TTS
Experiment on Single-Speaker Task
Discussion
MACHINE SPEECH CHAIN FRAMEWORK WITH SPEAKER ADAPTATION
Speaker Recognition and Embedding
Sequence-to-Sequence TTS With One-Shot Speaker Adaptation
Experiment on Multi-Speaker Task
END-TO-END FEEDBACK LOSS ON SPEECH CHAIN
End-to-End Feedback Loss
Experiment on Multi-Speaker Task in Supervised Settings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call