Towards End-2-end Learning for Predicting Behavior Codes from Spoken Utterances in Psychotherapy Conversations.

Karan Singla,Zhuohao Chen,David Atkins,Shrikanth Narayanan

doi:10.18653/v1/2020.acl-main.351

Abstract

Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pre-trained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.

Highlights

Speech interfaces have seen a widely growing trend and this has brought about increasing interest in advancing computational approaches to spoken language understanding (SLU). (Tur and De Mori, 2011; Xu and Sarikaya, 2014; Yao et al, 2013; Ravuri and Stolcke, 2015)
We focus on data from Motivational Interviewing (MI) sessions, a type of talk-based psychotherapy focused on behavior change
For the purpose of our experiments, we obtain the word segmentation information using a Forced-aligner (Ochshorn and Hawkins, 2016)

Summary

Introduction

Speech interfaces have seen a widely growing trend and this has brought about increasing interest in advancing computational approaches to spoken language understanding (SLU). (Tur and De Mori, 2011; Xu and Sarikaya, 2014; Yao et al, 2013; Ravuri and Stolcke, 2015). Our proposed approach aims to exploit speech signal to word encoder learnt using an architecture similar to Speech2Vec as lower level dynamic word representations for the utterance classifier. Most previous works follow the upper pipeline They start with a transcript (manually generated or through an ASR), which is first segmented into utterances. They use word-embeddings for each word in the transcript before feeding it into a classifier to predict target behavior codes. Our proposed method which does not rely on transcripts should help with cheaper and faster behavioral annotation We believe this framework can be a promising direction to directly perform classification tasks given a spoken utterance

Speech signal to word encoder

Utterance classifier

Dataset

Training details

Experiments & Results

Future work

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the conference. Association for Computational Linguistics. Meeting	Publication Date: Jan 1, 2020
Citations: 29	License type: cc-by

R Discovery Prime

R Discovery Prime

Towards End-2-end Learning for Predicting Behavior Codes from Spoken Utterances in Psychotherapy Conversations.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting

Lead the way for us

Similar Papers

Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR
Paola Garcia ... Jiatong Shi
-
Paola Garcia, et. al.Paola Garcia ... Jiatong Shi
04 Jun 2023
04 Jun 2023

Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection
Osamu Ichikawa ... Masafumi Nishimura
IEEE Journal of Selected Topics in Signal Processing | VOL. 4
Osamu Ichikawa, et. al.Osamu Ichikawa ... Masafumi Nishimura
01 Oct 2010
IEEE Journal of Selected Topics in Signal Processing | VOL. 4

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription
Xianrui Zheng ... Chao Zhang
-
Xianrui Zheng, et. al.Xianrui Zheng ... Chao Zhang
18 Sep 2022
18 Sep 2022

Hybrid voice activity detection system based on LSTM and auditory speech features
Yunus Korkmaz ... Aytuğ Boyacı
Biomedical Signal Processing and Control | VOL. 80
Yunus Korkmaz, et. al.Yunus Korkmaz ... Aytuğ Boyacı
17 Nov 2022
Biomedical Signal Processing and Control | VOL. 80

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards End-2-end Learning for Predicting Behavior Codes from Spoken Utterances in Psychotherapy Conversations.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting