Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process

César Montenegro,Roberto Santana,Jose A Lozano

doi:10.1016/j.engappai.2021.104189

Abstract

An End-Of-Turn Detection Module (EOTD-M) is an essential component of automatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in dialogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Understanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that models different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the different ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.

Highlights

Implementing Spoken Dialogue Systems involves solving several difficult machine learning problems
Mistakes in the Automatic Speech Recognition Module (ASRM) of a dialogue system based on the architecture illustrated in Fig. 1(a) will have an effect on the performance of the End-Of-Turn Detection Module (EOTD-M) and Natural Language Understanding Module (NLUM)
As it is not possible to generate all possible types of noise that an Automatic Speech Recognition Module (ASR-M) can receive, our goal is to introduce an Automatic Speech Recognition Simulator (ASR-SIM) that can be controlled in such a way that the transcribed data exhibits different types and rates of artifacts

Summary

Introduction

Implementing Spoken Dialogue Systems involves solving several difficult machine learning problems. Mistakes in the Automatic Speech Recognition Module (ASRM) of a dialogue system based on the architecture illustrated in Fig. 1(a) will have an effect on the performance of the End-Of-Turn Detection Module (EOTD-M) and Natural Language Understanding Module (NLUM). Different methods of converting words into numerical information (featurization) exploit different features of speech, the combination of classifier and featurization techniques could be sensitive to some errors and insensitive to other types of errors Investigating this relationship is complicated by the fact that the particular errors that an ASR-M produces depend on the features of human speech, ambient noise, and the performance of the ASR-M itself. This triggers the evaluation of the sentence or sentences received by the NLU-M

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Engineering Applications of Artificial Intelligence	Publication Date: Feb 15, 2021
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence

Lead the way for us

Similar Papers

Amharic Speech Search Using Text Word Query Based on Automatic Sentence-like Segmentation
Getnet Mezgebu Brhanemeskel ... Tewodros Alemu Ayall
Applied Sciences | VOL. 12
Getnet Mezgebu Brhanemeskel, et. al.Getnet Mezgebu Brhanemeskel ... Tewodros Alemu Ayall
18 Nov 2022
Applied Sciences | VOL. 12

A Global Discriminant Joint Training Framework for Robust Speech Recognition
Lujun Li ... Gerhard Rigoll
-
Lujun Li, et. al.Lujun Li ... Gerhard Rigoll
01 Nov 2021
01 Nov 2021

Improving Spoken Language Understanding by Enhancing Text Representation
Thai Binh Nguyen
-
Thai Binh NguyenThai Binh Nguyen
23 May 2022
23 May 2022

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
Lujun Li ... Ludwig Kürzinger
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021
Lujun Li, et. al.Lujun Li ... Ludwig Kürzinger
05 Jul 2021
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence