Abstract
Speech Activity Detectors (SADs) are essential in the noisy environments to provide an acceptable performance in the speech applications, such as speech recognition tasks. In this paper, a two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises. In the second stage, the remained segments are classified into speech or non-speech. To find the best feature set in speech/non-speech classification, a large set of robust features are introduced; the optimal subset of these features are chosen by applying a Genetic Algorithm (GA) to the initial feature set. It has been discovered that fractal dimensions of numeric series of prosodic features are the most speech/non-speech differentiating features. Models of the system are trained over a Farsi database, FARSDAT, however, test experiments on the TIMIT English database have been also conducted. Employing the SAD system in conjunction with an ASR system, has been resulted in a relative Word Error Rate (WER) reduction of as high as 28.3%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
More From: Pattern Recognition Letters
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.