Abstract

An End-Of-Turn Detection Module (EOTD-M) is an essential component of automatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in dialogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Understanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that models different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the different ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.

Highlights

  • Implementing Spoken Dialogue Systems involves solving several difficult machine learning problems

  • Mistakes in the Automatic Speech Recognition Module (ASRM) of a dialogue system based on the architecture illustrated in Fig. 1(a) will have an effect on the performance of the End-Of-Turn Detection Module (EOTD-M) and Natural Language Understanding Module (NLUM)

  • As it is not possible to generate all possible types of noise that an Automatic Speech Recognition Module (ASR-M) can receive, our goal is to introduce an Automatic Speech Recognition Simulator (ASR-SIM) that can be controlled in such a way that the transcribed data exhibits different types and rates of artifacts

Read more

Summary

Introduction

Implementing Spoken Dialogue Systems involves solving several difficult machine learning problems. Mistakes in the Automatic Speech Recognition Module (ASRM) of a dialogue system based on the architecture illustrated in Fig. 1(a) will have an effect on the performance of the End-Of-Turn Detection Module (EOTD-M) and Natural Language Understanding Module (NLUM). Different methods of converting words into numerical information (featurization) exploit different features of speech, the combination of classifier and featurization techniques could be sensitive to some errors and insensitive to other types of errors Investigating this relationship is complicated by the fact that the particular errors that an ASR-M produces depend on the features of human speech, ambient noise, and the performance of the ASR-M itself. This triggers the evaluation of the sentence or sentences received by the NLU-M

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.