Abstract

Predicting the roles of participants in conversations is a fundamental task to build a system that provides assessment results and feedback for each participant. Various role recognition models have been proposed. Nonetheless, most studies have only utilized verbal or nonverbal features even though people usually express what they think or feel with the combination of language, gestures, and tone of voice. In this paper, we aim to realize a high-performance role recognition model by combining features from various modalities. We design nonverbal features that can be extracted from video and audio data. Then, we construct a multimodal leader identification method that fuses nonverbal features proposed by us and verbal features proposed by a previous study. In our experiments, our multimodal model outperforms the baseline model that utilizes only verbal features. We also conduct some analysis, such as statistical tests and ablation studies, and verify the effectiveness of each modality and feature. In the end, we build a prototype of a feedback system and demonstrate how our study can be applied to the discussion assessment/feedback systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call