Abstract

Speech separation is the core problem of audio signal processing and key pre-processing step for automatic speech recognition. Magnitude spectrogram is reported as the standard time-and-frequency cross-domain representation for speech signals. Approaches such as time-frequency (T-F) mask and mapping estimation have been proposed to estimate clean speech on magnitude spectrogram. Recently, this sequential task has made great progress in behalf of time-domain mask estimation and dilated temporal convolutional networks (TCN) as used in Conv-TasNet. In this work, we propose a framework properly integrating directions above, which result in a new monaural speech separation framework. We explore time-domain mapping-based algorithm which directly estimate clean speech features in end-to-end system. We also make use of an optimal scale-invariant signal to distortion ratio (OSI-SDR) loss function. We evaluate this framework on a newly released noisy speech separation dataset (WHAM) and obtain encouraging results in preliminary experiments. Finally, we show that 1-dim learned convolution encoder works well while extracting features as a encoder compared with others.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.