Abstract

Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call