Abstract

Speech is one of the most direct and convenient human\machine interfaces. In real-world scenarios, however, various interferences and noises may deteriorate the speech signals and thus reduce speech quality and intelligibility. Therefore, speech enhancement (SE) is an essential component in speech-communication systems. Recently, numerous deep-learning-based SE approaches have been proposed and yield satisfactory performance. In a deep-learning-based SE system, defining a proper objective function plays a crucial role to its success. Generally, the mean square error (MSE) of the predicted and desired outputs are used to form the objective function to learn the parameters in deep-learning models. Because a sequence of speech signals contains various patterns, such as consonant, vowel, beginning and ending silences, and short pauses, it is not optimal to simply use MSE as the objective function, since the contributions of these different patterns may be averaged out. Instead, we should apply specific weights for distinct patterns when designing the objective function. In this presentation, we present a novel objective function, which is used in deep denoising autoencoder-based SE system. The proposed objective function is derived by MSE with multiplying a ratio calculated from clean and noisy speech. The result is evaluated using standardized evaluation metrics, and experiment results confirm the proposed objective function is beneficial to improve the intelligibility of enhanced speech.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call