There is growing interest in new audio formats in the context of virtual reality (VR), and higher-order ambisonics (HOA) is preferred for VR systems to transmit recorded scenes owing to its transmission efficiency and its flexibility to work with different loudspeaker setups. However, the conversion between another well-known format, i.e., object format, and the HOA format is not fully addressed in the literature. To address this issue, blind source separation in a spherical harmonic (SH) domain can be considered as the best way to extract objects in terms of efficiency, i.e., decoding HOA signals for separation can be omitted. A few authors attempted to extract objects from encoded HOA signals directly by using multichannel non-negative matrix factorization (MNMF), but these approaches either assume only far-field sources or do not take array characteristics into account, which make these methods difficult to use for VR in practical situations where singers or speakers often perform close to microphones. Furthermore, MNMF generally requires a huge computational cost, although dimensional reduction to the SH domain is performed. In this work, we also model near-field sources by estimating the model parameters of non-negative tensor factorization (NTF) in the SH domain assuming that microphone signals can be obtained with a rigid spherical array. We propose a masking scheme to exclude noisy evanescent regions in the SH domain from the NTF cost function. Evaluations show that our method outperforms existing methods devised for the HOA format and that our masking approach is effective in improving the separation quality.
Read full abstract