Abstract
Recently, deep neural networks (DNNs) were successfully introduced to the speech enhancement area. Conventional DNN-based algorithms generally produce over-smoothed output features which deteriorate the quality of the enhanced speech. In addition, their performance measures calculated in the linear frequency scale do not match the human auditory perception where the sensitivity follows the Mel-frequency scale. In this paper, we propose a novel objective function for DNN-based speech enhancement algorithm. In the proposed technique, a new objective function which consists of the Mel-scale weighted mean square error, and temporal and spectral variations similarities between the enhanced and clean speech is employed in the DNN training stage. The proposed objective function helps to compute the gradients based on a perceptually motivated non-linear frequency scale and alleviates the over-smoothness of the estimated speech. In the experiments, the performance of the proposed algorithm was compared to the conventional DNN-based speech enhancement algorithm in matched and mismatched noise conditions. From the experimental results, we can see that the proposed algorithm performs better than the conventional algorithm in terms of both the objective and subjective measures.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have