Machine Speech Chain

Andros Tjandra,Satoshi Nakamura,Sakriani Sakti

doi:10.1109/taslp.2020.2977776

Abstract

Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data.

Highlights

S PEECH chain, a concept introduced by Denes et al [1], describes the basic mechanism involved in speech communication when a spoken message travels from the speaker‘s mind to the listener’s mind (Fig. 1)
The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data
In the subjective evaluation, based on the quality of the synthesized speech using mean opinion score (MOS) test based on five-point scale (5: very good - 1: very poor)

Summary

INTRODUCTION

S PEECH chain, a concept introduced by Denes et al [1], describes the basic mechanism involved in speech communication when a spoken message travels from the speaker‘s mind to the listener’s mind (Fig. 1). Due to deep learning’s representational power, many complicated hand-engineered models have been simplified by letting deep neural nets (DNNs) learn their way from input to output spaces With this newly emerging approach to sequence-to-sequence mapping tasks, a model with a common architecture can directly learn the mapping between variable-length representations of different modalities: text-to-text sequences [21], [22], speechto-text sequences [23], [24], text-to-speech sequences [25], and image-to-text sequences [26], etc. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning and construct a sequence-to-sequence model for both ASR and TTS tasks, as well as a loop connection between these two processes. Our contributions in this paper include: 1) Basic machine speech chain that integrates ASR and TTS and performs on single-speaker task. 2) Multi-speaker speech chain with a speaker-embedding network for handling speech with different voice characteristics. 3) Machine speech chain with a straight-through estimator to allow end-to-end feedback loss through discrete units or subwords

RELATED WORKS

Overview

8: Calculate the loss for ASR and TTS

Sequence-to-Sequence Model for ASR

Sequence-to-Sequence Model for TTS

Experiment on Single-Speaker Task

Discussion

MACHINE SPEECH CHAIN FRAMEWORK WITH SPEAKER ADAPTATION

Speaker Recognition and Embedding

Sequence-to-Sequence TTS With One-Shot Speaker Adaptation

Experiment on Multi-Speaker Task

END-TO-END FEEDBACK LOSS ON SPEECH CHAIN

End-to-End Feedback Loss

Experiment on Multi-Speaker Task in Supervised Settings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2020
Citations: 53	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Machine Speech Chain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Listening while speaking: Speech chain by deep learning
Andros Tjandra ... Sakriani Sakti
-
Andros Tjandra, et. al.Andros Tjandra ... Sakriani Sakti
01 Dec 2017
01 Dec 2017

Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS
Sahoko Nakayama ... Satoshi Nakamura
-
Sahoko Nakayama, et. al.Sahoko Nakayama ... Satoshi Nakamura
01 Dec 2018
01 Dec 2018

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments
Sashi Novitasari ... Satoshi Nakamura
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Sashi Novitasari, et. al.Sashi Novitasari ... Satoshi Nakamura
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

Whither speech production?
A Michael Noll
The Journal of the Acoustical Society of America | VOL. 47
A Michael NollA Michael Noll
01 Jun 1970
The Journal of the Acoustical Society of America | VOL. 47

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine Speech Chain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing