A Two-Stage Selective Fusion Framework for Joint Intent Detection and Slot Filling.

Ziyu Ma,Shutao Li,Bin Sun

doi:10.1109/tnnls.2022.3202562

Abstract

Spoken language understanding (SLU) is the core of the speech-centric human-robot interaction system, which mainly involves intent detection and slot filling. The recent SLU research focuses on the joint modeling of the two tasks due to their correlation. Furthermore, the slot information consists of slot position and slot type. Although the slot types are semantically related to the intent, the slot positions of the same intent may vary a lot in different utterances due to the diversity of spoken language. Thus, the conventional one-stage slot filling task may introduce unrelated information for slot position prediction in the slot-intent interaction of the joint modeling. Therefore, we propose a novel two-stage selective fusion framework for joint intent detection and slot filling. Unlike the previous one-stage framework, the proposed framework decomposes the slot filling into two stages, i.e., the slot proposal and slot classification. The slot proposal network consisting of BERT and bidirectional long short-term memory (Bi-LSTM)-conditional random field (CRF) predicts the slot positions. Instead of the tokenwise fusion in the existing methods, the slot-intent feature fusion is only performed in the slot classification. A selective fusion mechanism is designed to facilitate the slot-intent interaction within each slot candidate for more accurate slot-type classification. Experiments on five standard benchmarks (i.e., ATIS, SNIPS, MixATIS, MixSNIPS, and DSTC4) show that the proposed framework achieves the best performance in comparison with several state-of-the-art methods.

Full Text