Two common disciplines of speech processing are speaker recognition “identification and verification of speaker”, and speaker diarization, “who spoke when”. Motivated by various applications in automatic speaker recognition, speaker indexing, word counting, and audio transcription, speaker diarization (SD) becomes a significant area of signal processing. The basic designing steps of SD are feature extraction, voice activity detection (VAD), segmentation, and clustering. VAD process is accomplished by Daubechies 40, discrete wavelets transform (DWT). Initially, DWT was used for compression, scaling, and denoising of audio-stream and then partitioned into small frames of size 0.12 seconds. Next, features of each frame were extracted by applying nonlinear energy operator (NEO) based pyknogram. To measure the similarity between frames, a sliding window on delta-BIC distance metric was applied. A negative value of its output represents the same segments and vice-versa. To improve the output of the segmentation process, resegmentation was applied by information change rate method. At last, hierarchical clustering groups the homogeneous segments that correspond to a particular speaker and has been graphically represented by the dendrogram. The performance of SD was evaluated by F-measure and speaker diarization error rate (SER) and their results were compared with the traditional speaker diarization system that uses MFCC and BIC for segmentation and clustering. It reveals a significant reduction of 12.3% of SER in the proposed diarization system.
Read full abstract