A Unified Framework for Neural Speech Separation and Extraction

Tsubasa Ochiai,Marc Delcroix,Atsunori Ogawa,Tomohiro Nakatani,Keisuke Kinoshita

doi:10.1109/icassp.2019.8683448

Abstract

The development of deep learning techniques has triggered the active investigation of neural network-based speech enhancement approaches. In particular, single-channel blind (uninformed) speech separation and speaker-aware (informed) speech extraction have received increased interest. Blind speech separation separates a speech mixture into all source signals without requiring any auxiliary information about the speakers. In contrast, speaker-aware speech extraction focuses on extracting speech from a target speaker using prior knowledge, such as an utterance spoken by the target speaker. Speaker extraction is therefore not fully blind, but it can mitigate the source permutation problem faced by blind source separation, and potentially achieve better speech quality by exploiting the auxiliary information. In this paper, to take advantage of both approaches, we propose a unified framework for both speech separation and speech extraction using a single model. This is realized by incorporating a speaker attention mechanism within a generalized permutation invariant training (PIT)-based blind speech separation model, and introducing a multitask separation/extraction objective for training the model. Experiments on the WSJ0-2mix dataset show that our proposed framework realizes both uninformed separation and informed extraction, and achieves better separation/extraction performance than a baseline PIT-based model.

Full Text