Abstract

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

Highlights

  • Speaker diarization systems assign speech segments to the people appearing in a dialogue

  • Our second comparison method is fully supervised speaker diarization (SD) described in [23] which employs speech activity detection and speaker change detection [48] based on bidirectional LSTM [49], neural speaker embedding [50] based on a LSTM network and triple loss function

  • We developed our audio-visual speaker diarization system in Python using Anaconda Python distribution on the Windows 10 platform on a computer equipped with a 2.6 GHz CPU, 16 GB RAM

Read more

Summary

Introduction

Speaker diarization systems assign speech segments to the people appearing in a dialogue. It is usually an unsupervised technique where the number of speakers is unknown and limited information is available about them. Diarization tasks are very challenging when unimodal data is available. Audio-based diarization has lot of complexities due to overlapping speech utterances from various speakers, environmental noise, short utterances and reverberations. In video data, speakers may not face the camera, move in a multi-party interaction way or they can be occluded by other speakers. The use of the video modality limits one to using lip and face movement detection for diarization. The configuration of the recording equipment varies a lot.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.