Abstract

Target sound extraction is a task to extract only a desired sound signal from a mixture of different sounds, using a clue given by a target class label or a target signal similar to the desired sound. Currently, available network architectures for this task are designed to handle only dry sounds. In this work, we introduce a transformer-based target sound extraction model that can extract reverberant sounds. To separate reverberant sound mixtures, we begin with the Dense Frequency-Time Attentive Network (DeFT-AN) architecture developed for speech enhancement tasks, which generates the complex short-time Fourier transform (STFT) mask of clean speech from a noisy reverberant mixture to suppress noises. To make DeFT-AN compatible with the target sound extraction task, we modify its architecture such that the embedding vector for the target class label can be fused in the middle of sequentially connected DeFT-A blocks constituting DeFT-AN. We demonstrate that the transformer-based speech enhancement model can be successfully converted into a target sound extraction model and outperforms state-of-the-art extraction models in the test carried out with reverberant mixtures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call