Abstract

In speaker diarization, the speech/voice activity detection is performed to separate speech, non-speech and silent frames. Zero crossing rate and root mean square value of frames of audio clips has been used to select training data for silent, speech and nonspeech models. The trained models are used by two classifiers, Gaussian mixture model (GMM) and Artificial neural network (ANN), to classify the speech and non-speech frames of audio clip. The results of ANN and GMM classifier are compared by Receiver operating characteristics (ROC) curve and Detection ErrorTradeoff (DET) graph. It is concluded that neural network based SADcomparatively better than Gaussian mixture model based SAD.

Highlights

  • Speaker Diarization is a process to determine “who spoke what” in an audio recording of meeting

  • In speaker diarization system speech activity detection (SAD) divides the audio signal into speech and non-speech signals using zero crossing rate (ZCR) and root mean square (RMS) value shown in figure 3

  • The data sources are video clips free down loaded from youtube.com in MP4 format, further it is converted into .wav form, and noise free recordings of TV shows by using single distant microphone (SDM)

Read more

Summary

INTRODUCTION

Speaker Diarization is a process to determine “who spoke what” in an audio recording of meeting. In the process of speaker dirization, there is need of some tools for extracting features of audio signal, speech activity detection (SAD), segmentation, clustering and re-alignment, shown in figure 1. It composed of speech feature extraction, Feature selection, SAD, speech segmentation, model clustering, re-alignment and evaluation module. In speaker diarization system SAD divides the audio signal into speech and non-speech signals using zero crossing rate (ZCR) and root mean square (RMS) value shown in figure 3. 3. The trained GMM models and test data are used by GMM classifiers to classify it into speech and non_speech frames. 4. the trained neural network model and test data are given to artificial neural network (ANN) classifier to classify it into speech and non-speech frames. The outputs of two classifiers GMM and ANN are evaluated by receiver operating characteristics (ROC), area under ROC curve (AUC) and DET graph [11]

EXPERIMENTS AND RESULTS
Evaluation criteria
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call