Abstract

Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question `who spoke when?'. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).

Highlights

  • Speaker diarization system identifies speaker homogeneous regions via either supervised or unsupervised approaches

  • RESULTS & DISCUSSION Table 1 presents the results for multimodal speaker diarization system for additive white Gaussian noise (AWGN) based noisy speech and its enhancement via Wiener filtering and LSTM model

  • On average LSTM model provided 11.6% improvement and Wiener filtering degraded the performance by 1.23% in diarization error rate of the system

Read more

Summary

INTRODUCTION

Speaker diarization system identifies speaker homogeneous regions via either supervised or unsupervised approaches. Audio-visual approaches were proposed to aid the diarization process for robust segmentation and clustering of homogeneous speakers. These approaches usually try to identify the active speaker in video. Available multimodal approaches usually apply lip and face movement detection and tracking [20]–[23] or audio-visual fusion [22], [24], [25] technique at feature or output stage of diarization. Several multimodal diarization techniques have been proposed such as [21], [23], [25]–[29] Each of these techniques vary depending upon the recording scenarios in the dataset such as overlapping speech, available audio-visual recordings and speaker movement.

LITERATURE REVIEW
MATERIALS AND METHODS
LSTM BASED SPEECH ENHANCEMENT
VIDEO PIPELINE
AUDIO-VISUAL SYNCHRONIZATION MODEL
MULTIMODAL DIARIZATION PIPELINE
EVALUATION METRIC
COMPUTATIONAL COST
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.