Abstract

We propose a method for next-speaker prediction, a task to predict who speaks in the next turn among multiple current listeners, in multi-party video conversation. Previous studies used non-verbal features, such as head movements and gaze behavior, for next-speaker prediction in face-to-face conversation. However, in video conversation, these non-verbal features are vague and ineffective because they look at the screen displaying other participants. Since non-verbal features include participant characteristics, it is necessary to use training data with rich combinations of participants to robustly predict the next speaker. Previous studies used training data with a limited number of combinations of participants because the data consist only of recorded data. Therefore, the proposed method uses 1) novel non-verbal features for next-speaker prediction in video conversation, specifically facial expressions, hand movements and speech segments, and 2) data augmentation of participant combinations in the training data. We conducted experiments to evaluate the proposed method, and the results using video-conversation data indicate its effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call