Abstract

Separation of pre-recorded messages (Interactive Voice Response, IVR) from live speech fragments in real-time plays a significant role in speech emotion recognition (SER) systems, unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc. The problem complexity is that, unlike with silent, music, and noise fragments studied by the conventional voice activity recognition (VAD), IVR usually contains speech. Three classifiers for live speech fragments detection in phone call records are considered: based on the support vector machine (SVM), gradient boosting (XGBoost) and convolutional neural network (CNN). The Geneva Minimalistic Acoustic Parameter Set for XGBoost and SVM, and log-spectrograms and gammatonegrams for CNN were used for feature representation of audio fragments. Experiments with a dataset of phone calls demonstrate comparable quality (around 0.96 according to the F1-averaged measure) of the considered algorithms with CNN having a advantage (0.98).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call