Abstract

Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pre-trained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.

Highlights

  • Speech interfaces have seen a widely growing trend and this has brought about increasing interest in advancing computational approaches to spoken language understanding (SLU). (Tur and De Mori, 2011; Xu and Sarikaya, 2014; Yao et al, 2013; Ravuri and Stolcke, 2015)

  • We focus on data from Motivational Interviewing (MI) sessions, a type of talk-based psychotherapy focused on behavior change

  • For the purpose of our experiments, we obtain the word segmentation information using a Forced-aligner (Ochshorn and Hawkins, 2016)

Read more

Summary

Introduction

Speech interfaces have seen a widely growing trend and this has brought about increasing interest in advancing computational approaches to spoken language understanding (SLU). (Tur and De Mori, 2011; Xu and Sarikaya, 2014; Yao et al, 2013; Ravuri and Stolcke, 2015). Our proposed approach aims to exploit speech signal to word encoder learnt using an architecture similar to Speech2Vec as lower level dynamic word representations for the utterance classifier. Most previous works follow the upper pipeline They start with a transcript (manually generated or through an ASR), which is first segmented into utterances. They use word-embeddings for each word in the transcript before feeding it into a classifier to predict target behavior codes. Our proposed method which does not rely on transcripts should help with cheaper and faster behavioral annotation We believe this framework can be a promising direction to directly perform classification tasks given a spoken utterance

Speech signal to word encoder
Utterance classifier
Dataset
Training details
Experiments & Results
Future work
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call