Multimodal Speaker Diarization Research Articles

Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question `who spoke when?'. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

Multimodal Speaker Diarization Research Articles

Related Topics

Articles published on Multimodal Speaker Diarization

Speech Enhancement for Multimodal Speaker Diarization System

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Soft Nonnegative Matrix Co-Factorization

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Speaker Diarization Research Articles

Related Topics

Articles published on Multimodal Speaker Diarization

Speech Enhancement for Multimodal Speaker Diarization System

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Soft Nonnegative Matrix Co-Factorization

Audio-visual speaker diarization using fisher linear semi-discriminant analysis