Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

William Ravenscroft,Thomas Hain,Stefan Goetze

doi:10.3389/frsip.2022.856968

William Ravenscroft, Thomas Hain + Show 1 more

Open Access

https://doi.org/10.3389/frsip.2022.856968

Copy DOI

Journal: Frontiers in Signal Processing	Publication Date: May 11, 2022
Citations: 7	License type: CC BY 4.0

Affiliation: University of Sheffield

Abstract

Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

Abstract

Talk to us

Similar Papers

More From: Frontiers in Signal Processing

Lead the way for us

Similar Papers

Joint blind dereverberation and separation of speech mixtures
...
-
, et. al. ...
18 Oct 2012
18 Oct 2012

Automatic Extraction of Water and Shadow from SAR Images Based on a Multi-Resolution Dense Encoder and Decoder Network.
Peng Zhang ... Zhihui Yuan
Sensors | VOL. 19
Peng Zhang, et. al.Peng Zhang ... Zhihui Yuan
16 Aug 2019
Sensors | VOL. 19

Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network
Aye Nyein Aung ... Jeih-Weih Hung
Future Internet | VOL. 16
Aye Nyein Aung, et. al.Aye Nyein Aung ... Jeih-Weih Hung
28 Apr 2024
Future Internet | VOL. 16

A unified speaker-dependent speech separation and enhancement system based on deep neural networks
Tian Gao ... Li Xu
-
Tian Gao, et. al.Tian Gao ... Li Xu
01 Jul 2015
01 Jul 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

Abstract

Talk to us

Similar Papers

More From: Frontiers in Signal Processing