Multi-Channel Audio Source Separation Using Azimuth-Frequency Analysis and Convolutional Neural Network

Jung Min Moon,Chan Jun Chun,Tae Woo Kim,Jun Ho Kim,Hong Kook Kim

doi:10.1109/icaiic.2019.8668841

Abstract

Since MPEG-H supports not only channel-based but also object-based audio content, there is a need for a sound source separation technique that converts channel-based to object-based audio. Among the various sound source separation techniques, azimuth-frequency (AF) based sound source separation has been proposed for converting channel-based audio to object-based audio. Unfortunately, it is difficult to set the optimal azimuth and width using this technique. In this paper, we propose a method to determine the optimal azimuth and width based on a convolutional neural network (CNN) classifier. First, depending on numerous azimuths and widths, different sets of audio signals are separated. After that, each audio set is categorized into a specific audio class using the CNN classifier. Then, in order to separate a desired audio signal, the azimuth and width with the highest similarity for a given class are selected. The performance of the CNN classifier is evaluated in terms of separation accuracy and objective measures such as signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). Consequently, the proposed method provides higher SDR, SAR, SIR, and separation accuracy than a minimum variance distortionless response (MVDR) beamformer as well as a method that only uses AF analysis.

Full Text