AbstractMask-based multichannel speech enhancement methods based on artificial neural networks estimate a mask that is applied to the multichannel input signal or a reference channel to obtain the estimated desired signal. For the estimation, both spectral and spatial cues from the multichannel input can be used. However, the interplay of the two inside the neural network is typically unknown. In this contribution, we propose a framework to analyze neural spatiospectral filters (NSSFs) with respect to their capabilities to extract and represent spatial information. We explicitly take the characteristics of the training target signal into account and analyze its effect on the functionality of the NSSF. Using two conceptually different NSSFs as example, we show that not all NSSFs use spatial information under all circumstances and that the training target signal has a significant influence on the spatial filtering behavior of an NSSF. These insights help to assess the signal processing capabilities of neural networks and allow to make informed decisions when configuring, training, and deploying NSSFs.
Read full abstract