Abstract

Because traditional single-channel speech enhancement algorithms are sensitive to the environment and perform poorly, a speech enhancement algorithm based on attention-gated long short-term memory (LSTM) is proposed. To simulate human auditory perceptual characteristics, the algorithm divides the frequency band according to the Bark scale. Based on these bands, bark frequency cepstral coefficients (BFCCs), their derivative features and pitch-based features are extracted. Furthermore, considering that different noises have different influence on the clean speech, the attention mechanism is applied to screen out the information less polluted by noise, which is helpful to reconstruct the clean speech. To adaptively reallocate the power ratio of the speech and noise during the construction of the ratio mask, the ideal ratio mask (IRM) with the inter-channel correlation (ICC) is adopted as the learning target. In addition, to improve the performance of the network, the algorithm introduces a multiobjective learning strategy to jointly optimize the networks by using a voice activity detector (VAD). Subjective and objective experiments show that the proposed algorithm outperforms other baseline algorithms. In real-time experiment, the proposed algorithm maintains high real-time performance and fast convergence speed.

Highlights

  • Speech enhancement(SE) has important applications in various fields of speech processing, mainly to improve the quality and intelligibility of speech corrupted by noise

  • A fast Fourier transform (FFT) is carried out, the frequency band is divided according to the Bark scale, and signal features are extracted

  • The STOI index of the deep neural networks (DNNs) algorithm and that of the generative adversarial network (GAN) algorithm are obviously lower than the STOI index of noisy signals under unmatched noise; these results show that the generalization ability of these two algorithms is low

Read more

Summary

INTRODUCTION

Speech enhancement(SE) has important applications in various fields of speech processing, mainly to improve the quality and intelligibility of speech corrupted by noise. Mahmmod et al propose an optimum low-distortion estimator with models that fit well with speech and noise signals to decrease the deviation of Gaussian or super-Gaussian models [7] These algorithms are designed based on the. R. Liang et al.: Real-Time SE Algorithm Based on Attention LSTM complex statistical characteristics of the interaction between noise and clean speech, but they usually assume that the noise signal is relatively stable or changes slowly. Fu et al propose an end-to-end SE framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and the evaluation criterion [27] This algorithm can effectively improve the corresponding objective metric and the intelligibility of the human subjects. Subjective and objective experiments show that the proposed algorithm is superior to other baseline algorithms while maintains high real-time performance and fast convergence speed

SIGNAL MODEL
IDEAL BAND GAINS
NETWORK OPTIMIZATION
EVALUATION INDEXES
COMPARISONS OF DIFFERENT RNNS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call