Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

Midia Yousefi,John H L Hansen

doi:10.1109/taslp.2020.3036237

Abstract

Speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition, and speech synthesis have advanced significantly by the emergence of deep learning techniques. However, none of these voice-enabled systems perform well in natural environmental circumstances, specifically in situations where one or more potential interfering talkers are involved. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datasets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The proposed architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference. We also investigate the effect of alternate input features including spectral magnitude, MFCC, MFB, and pyknogram on both computational time and classification performance. Evaluation is performed on simulated overlapping speech signals based on the GRID corpus. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and 92.8% Fscore on same gender overlapped speech. For opposite gender cases, the network scores exceed 95% in all the classification metrics.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Nov 7, 2020
Citations: 54	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering
Chunlei Zhang ... Dong Yu
-
Chunlei Zhang, et. al.Chunlei Zhang ... Dong Yu
23 May 2022
23 May 2022

A cross-cultural investigation of emotion inferences from voice and speech: implications for speech technology
Klaus R Scherer
-
Klaus R SchererKlaus R Scherer
16 Oct 2000
16 Oct 2000

Interacting with computers by voice: automatic speech recognition and synthesis
D O'Shaughnessy
Proceedings of the IRE | VOL. 91
D O'ShaughnessyD O'Shaughnessy
01 Sep 2003
Proceedings of the IRE | VOL. 91

Speech Recognition for Agglutinative Languages
R. Thangarajan
-
R. ThangarajanR. Thangarajan
28 Nov 2012
28 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing