A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Yifan Peng,Shinji Watanabe,Karthik Ganesan,Sujay Kumar,Xuankai Chang,Siddhant Arora,Yushi Ueda,Yosuke Higuchi,Siddharth Dalmia

doi:10.1109/slt54892.2023.10022399

Abstract

Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of un-paired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> Our code and models will be publicly available as part of the ESPnet-SLU toolkit.

Full Text