Abstract
Simultaneous speech of multiple speakers is known as overlapped speech, which causes problems for speech recognition and speaker diarization systems. The present work uses previously less utilized signal phase information in the task of overlapped speech detection. In this context, Instantaneous Frequency Cosine Coefficient (IFCC) and Modified Group Delay Cepstral Coefficient (MGDCC) features are explored. IFCC captures the time-varying phase characteristics, while MGDCC represents the frequency-varying information of the phase spectrum. A Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM)-based classifier is used for the classification. The present work uses synthetically generated overlapped speech from the GRID corpus. The proposed method is benchmarked against three baseline approaches that use magnitude spectrum features. It is observed that the combination of IFCC and MGDCC features with CNN-LSTM classifier provides better performance than the baselines. The combination of phase features with magnitude-based MFCC feature provides the best performance, indicating the importance of complementary information. The present study also investigates the effect of segment duration, genders, and number of simultaneous speakers on the overlapped speech detection system. Finally, the proposed method is also evaluated on real overlapped data from the AMI corpus.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have