Speech enhancement in a single channel has been well studied in the literature in applications such as speech communication systems. However, in emerging applications such as virtual reality and spatial audio, in addition to attenuating undesired signals, the ability to preserve the spatial information of the desired signal captured in a noisy environment is of great importance. Nevertheless, there are only a few studies in the literature that propose solutions to this challenge. Most of these studies present solutions that attenuate the undesired signals, while preserving only limited spatial information regarding the desired signal, such as the direction of arrival (DOA). Methods that preserve complete spatial information have only recently been suggested, and have not been studied comprehensively. In this paper, two such methods based on time-frequency masking are investigated with the aim of attenuating the undesired signal, while preserving the spatial components of the desired signal. The first is referred to as spatial masking and is based on masking in the plane wave density (PWD) domain, and the second on masking in the spherical harmonics (SH) domain. The two methods are compared with a reference method, based on beamforming followed by single-channel time-frequency masking. Objective analysis and two listening tests were conducted in order to evaluate the performance of these methods for speech enhancement. It was shown that the spatial masking based method better preserves the desired component of the sound field, while the performance of the SH based method more strongly depends on the sources' distances. On the other hand, the SH based method better preserves the DOA of the residual noise, while the DOA of the residual noise under the spatial masking based method is strongly affected by the undesired signal.
Read full abstract