Abstract

Precise detection of speech endpoints is an important factor which affects the performance of the systems where speech utterances need to be extracted from the speech signal such as Automatic Speech Recognition (ASR) system. Existing endpoint detection (EPD) methods mostly uses Short-Term Energy (STE), Zero-Crossing Rate (ZCR) based approaches and their variants. But STE and ZCR based EPD algorithms often fail in the presence of Non-speech Sound Artifacts (NSAs) produced by the speakers. Pattern recognition and classification techniques are also applied but those methods require labeled data for training. In this article, a novel approach is proposed to extract speech endpoints and the algorithm is termed as Wavelet Convolution based Speech Endpoint Detection (WCSED). WCSED decomposes the speech signal into high-frequency and low-frequency components using wavelet convolution and then computes information-entropy based thresholds for the two frequency components. The low-frequency thresholds are used to extract voiced speech segments, whereas the high-frequency thresholds are used to extract the unvoiced speech segments by filtering out the NSAs. WCSED does not require any labeled data for training and can automatically extract speech segments. Experiments are carried out on two speech databases and the results are promising even in the presence of NSAs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call