Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Kalin Stefanov,Jonas Beskow,Giampiero Salvi

doi:10.1109/tcds.2019.2927941

Abstract

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

Highlights

T HE ABILITY to acquire and use language in a similar manner as humans may provide artificial cognitive systems with a unique communication capability and the means for referencing to objects, events, and relationships
We propose to take advantage of the temporal synchronization of the visual and auditory modalities in order to improve the robustness of audio-based active speaker detection
The video-only method outperforms the audio-only voice activity detector (VAD) for more noisy conditions, whereas the opposite is true if the signal-to-noise ratio (SNR) is greater than 20

Summary

Introduction

T HE ABILITY to acquire and use language in a similar manner as humans may provide artificial cognitive systems with a unique communication capability and the means for referencing to objects, events, and relationships. An artificial cognitive system with this capability will be able to engage in natural and effective interactions with humans. Developing such systems can help us Manuscript received November 12, 2017; revised July 1, 2018, October 30, 2018, and April 15, 2019; accepted July 2, 2019. Date of publication July 10, 2019; date of current version June 10, 2020. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. As mentioned in [1], modeling language acquisition is very complex and should integrate different aspects of signal processing, statistical learning, visual processing, pattern discovery, and memory access and organization

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Cognitive and Developmental Systems	Publication Date: Jun 1, 2020
Citations: 49	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Cognitive and Developmental Systems

Lead the way for us

Similar Papers

AS-Net: active speaker detection using deep audio-visual attention
Abduljalil Radman ... Jorma Laaksonen
Multimedia Tools and Applications | VOL. 83
Abduljalil Radman, et. al.Abduljalil Radman ... Jorma Laaksonen
05 Feb 2024
Multimedia Tools and Applications | VOL. 83

Active Speakers in Context
Juan Leon Alcazar ... Long Mai
-
Juan Leon Alcazar, et. al.Juan Leon Alcazar ... Long Mai
01 Jun 2020
01 Jun 2020

Active speaker detection with audio-visual co-training
Punarjay Chakravarty ... Tinne Tuytelaars
-
Punarjay Chakravarty, et. al.Punarjay Chakravarty ... Tinne Tuytelaars
31 Oct 2016
31 Oct 2016

Real-Time Speaker Identification and Subtitle Overlay with Multithreaded Audio Video Processing
Sahith Madamanchi ... Prema Nedungadi
Procedia Computer Science | VOL. 233
Sahith Madamanchi, et. al.Sahith Madamanchi ... Prema Nedungadi
01 Jan 2024
Procedia Computer Science | VOL. 233

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Cognitive and Developmental Systems