Voice Activity Detection: Fusion of Time and Frequency Domain Features with A SVM Classifier

doi:10.7176/ceis/13-3-03

Abstract

Voice activity detection (VAD) discriminates between segments of an audio signal that has speech content from the ones with either noise or silence. It is deployed as the front-end of some speech processing applications such as voice recognition, and speaker recognition to improve their performance in terms of accuracy and efficiency. It is also used in the communication system to bring about efficient utilization of transmission bandwidth by ensuring only segments of the audio signal with voice activity are encoded and transmitted. In this work, the VAD algorithm was implemented using a features-fusion strategy. In the pre-processing stage, contents outside the human auditory frequency range were removed with the aid of a digital Butterworth bandpass filter. The signal was then fragmented into frames from where time-domain features (zero-crossing, standard deviation, normalized envelope, kurtosis, skewness, and root-mean-square energy.) and frequency-domain features (13MFCCs) were extracted and then combined to form a feature representation of each frame. Recursive feature elimination was applied to the dataset to reduce the features to seven (7) which was used to train a Support Vector Machine (SVM) to be able to distinguish between voiced and unvoiced frames. A State-of-art performance was recorded by this simple SVM-based VAD system with an accuracy of 100%, recall of 100%, precision of 100% and F1 score of 100% which is at par with similar implementations which utilizes a complex architecture of deep neural network with high computational cost and training time. Keywords : Voice activity detection, fusion strategy, support vector machine, frequency domain features, time domain features DOI: 10.7176/CEIS/13-3-03 Publication date: May 31 st 2022

Full Text