Abstract

With the exponential growth in computing power and progress in speech recognition technology, spoken dialog systems (SDSs) with which a user interacts through natural speech has been widely used in human-computer interaction. However, error-prone automatic speech recognition (ASR) results usually lead to inappropriate semantic interpretation so that miscommunication happens easily. This paper presents an approach to error-aware dialog state (DS) detection for robust miscommunication handling in an SDS. Non-understanding (Non-U) and misunderstanding (Mis-U) are considered for miscommunication handling in this study. First, understanding evidence (UE), derived from the recognition confidence, is adopted for Non-U detection followed by Non-U recovery. For Mis-U with the recognized sentence containing uncertain recognized words, the partial sentences obtained by removing potentially misrecognized words from the input utterance are organized, based on regular expressions, as a tree structure to tolerate the deletion or rejection of keywords resulting from misrecognition for Mis-U DS modeling. Latent semantic analysis is then employed to consider the verified words and their n-grams for DS detection, including Mis-U and predefined Base DSs. Historical information-based n-grams are employed to find the most likely DS for the SDS. Several experiments were performed with a dialog corpus for the restaurant reservation task. The experimental results show that the proposed approach achieved a promising performance for Non-U recovery and Mis-U repair as well as a satisfactory task success rate for the dialogs using the proposed method.

Highlights

  • In recent years, voice-driven human-computer interaction has benefited greatly from steady improvements in the underlying speech technologies, such as speech recognition, speech synthesis, natural language understanding, and machine learning [1]

  • understanding evidence (UE) type determination was performed for Non-U and Mis-U detection followed by partial sentence generation

  • 5.1 Experiment on sentence clusters and latent semantic analysis (LSA) dimensionality Generally, the sentence cluster number and the dimensionality in LSA are determined by the prediction risk

Read more

Summary

Introduction

Voice-driven human-computer interaction has benefited greatly from steady improvements in the underlying speech technologies, such as speech recognition, speech synthesis, natural language understanding, and machine learning [1]. Spoken dialog systems (SDSs) are supposed to enable an efficient and intuitive communication between humans and computers [2], and help users achieve the goal which they want to accomplish by using spoken languages. This could be done by mapping the spoken utterance to the semantic meaning of the recognized word sequence using the automatic speech recognition (ASR) technology. Because the errors from ASR and SLU are generally encountered, DST task facing the errors may lead to misunderstanding of the user’s intention. Several methods have been proposed to deal with the problems on ASR and SLU errors for performance improvement. One of the prominent human-computer interaction research areas, has been applied to a wide range of domains from simple

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call