Abstract

In this paper, we propose a multi-channel cross-tower with attention mechanisms in latent domain network (Multi-TALK) that suppresses both the acoustic echo and background noise. The proposed approach consists of the cross-tower network, a parallel encoder with an auxiliary encoder, and a decoder. For the multi-channel processing, a parallel encoder is used to extract latent features of each microphone, and the latent features including the spatial information are compressed by a 1D convolution operation. In addition, the latent features of the far-end are extracted by the auxiliary encoder, and they are effectively provided to the cross-tower network by using the attention mechanism. The cross tower network iteratively estimates the latent features of acoustic echo and background noise in each tower. To improve the performance at each iteration, the outputs of each tower are transmitted as the input for the next iteration of the neighboring tower. Before passing through the decoder, to estimate the near-end speech, attention mechanisms are further applied to remove the estimated acoustic echo and background noise from the compressed mixture to prevent speech distortion by over-suppression. Compared to the conventional algorithms, the proposed algorithm effectively suppresses the acoustic echo and background noise and significantly lowers the speech distortion.

Highlights

  • As the demand for smart devices operated by speech commands continues to increase, the coupling between the loudspeaker and microphone in a smart device under ambient noise significantly degrades the quality of speech communication and the performance of automatic speech recognition

  • We proposed the Multi-TALK algorithm, which simultaneously suppresses acoustic echo and background noise

  • Rather than estimating near-end speech directly from the mixture, Multi-TALK is designed to use a cross-tower for estimating the echo and noise to be removed and uses these estimates for near-end speech estimation

Read more

Summary

Introduction

As the demand for smart devices operated by speech commands continues to increase, the coupling between the loudspeaker and microphone in a smart device under ambient noise significantly degrades the quality of speech communication and the performance of automatic speech recognition. The traditional approach to acoustic echo cancellation (AEC) is to estimate the acoustic echo path from the loudspeaker to the microphone using an adaptive filter [1] To allow these AEC methods to work appropriately, several additional issues should be resolved. In order to prevent the adaptive filter from divergence due to double-talk, a method of employing a separate double-talk detector was used [4,5] to force the adaptive filter not to update when a double-talk occurs. It is more challenging when both echo and noise exist, which further require a noise suppression module.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call