Abstract

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.

Highlights

  • Audio-visual speaker tracking is a key technology of humanmachine interaction, driven by applications such as intelligent surveillance, smart space, and multimedia systems

  • As the representative statespace approach based on Bayesian framework, Kalman filter (KF) [2], extended KF (EKF), and particle filter (PF) [3] are commonly used methods

  • Different from the above Bayesian methods, the speaker number is estimated during the PHDbased tracking process, and the Probability hypothesis density (PHD) filter is considered promising for multispeaker tracking

Read more

Summary

Introduction

Audio-visual speaker tracking is a key technology of humanmachine interaction, driven by applications such as intelligent surveillance, smart space, and multimedia systems. By analyzing the audio-visual data captured by multimodal sensor arrays, the positions of the speakers in the scene are continuously tracked, providing the underlying basis for subsequent action recognition and interaction. Current methods for speaker tracking are built on the probabilistic generation models due to their ability to process multimodal information. PF can recursively approximate the filtering distribution of tracking targets by using dynamic models and random sampling. Different from the above Bayesian methods, the speaker number is estimated during the PHDbased tracking process, and the PHD filter is considered promising for multispeaker tracking. The PHD filter restricts the propagation of the multitarget posterior distribution to the first-order moment, resulting in the loss of high-order cardinal information, which leads to speaker number estimation errors in low signal-to-noise ratio situation [5]. PF is selected as the tracking framework in this paper since it approaches the Bayesian optimal estimates without being constrained by linear systems and Gaussian assumptions [6]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call