Abstract

Although numerous works have studied the problem of automatic speaker identification (SID), there are only few works on the SID for overlapping speech, and none of them consider the case of more than two simultaneous speakers. Recognizing that overlapping speech occurs frequently in real-life scenarios, such as in meetings or debates, this work investigates the methods for overlapping SID (OSID) that can determine identities in the overlapping speech from up to five simultaneous speakers. We propose two deep-learning OSID systems, one is two-stage and the other is single-stage. The two-stage system determines the number of simultaneous speakers firstly, followed by identifying the speaker(s). The single-stage system uses a single classifier to perform OSID directly, which is slightly more computationally efficient than the two-stage system. Our experiments show that the two-stage OSID system achieves better identification accuracy than that of the single-stage system. In addition, both the OSID systems based on one-dimensional convolutional neural networks (1DCNN) perform better than the systems based on multilayer perceptron (MLP) and Gaussian mixture models (GMMs). The proposed 1DCNN-based two-stage OSID system achieves 98.55% OSID accuracy for the clean audio data containing up to five simultaneous speakers. In more challenging experimental conditions involving both background noises and high overlapping energy ratios, the system still attained accuracies of above 90%.

Highlights

  • Overlapping speech or simultaneous speech, in which multiple persons speak simultaneously, naturally take place in real-life scenarios such as daily conversations, telephone conversations, multiparty meetings, or debates

  • 2) RESULTS OF S-overlapping SID (OSID) AND two-stage OSID system (T-OSID) SYSTEMS IN DIFFERENT OERS To investigate the effectiveness of the proposed Deep learning (DL)-based approaches for OSID, we evaluated the performance of the OSID systems in case the evaluation data contained various overlapping energy ratios (OERs)

  • We proposed employing the deep learning-based approach to build the OISD systems besides the use of the widely-used Gaussian mixture models (GMMs)-based approach

Read more

Summary

Introduction

Overlapping speech or simultaneous speech, in which multiple persons speak simultaneously, naturally take place in real-life scenarios such as daily conversations, telephone conversations, multiparty meetings, or debates. The work [1] pointed out that about 8% to 17% of words in meeting conversations contained overlapping speech, while the percentage of overlapped words in phone conversations is 11 % to 12%. The appearance of overlapping speech results in the performance degradation of many speech applications such as speaker identification (SID), speaker diarization, and automatic speech recognition (ASR) [1]–[5]. It was reported in [2] that overlaps in meeting speech respectively leaded to an additional word error rate (WER) of 11% and further diarization error of 17%.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call