DNN Based Mask Estimation for Supervised Speech Separation

Jitong Chen,Deliang Wang

doi:10.1007/978-3-319-73031-8_9

Abstract

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

Full Text